Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I want to replace each missing value in the first column of my dataframe with the previous one multiplied by a scalar (eg. 3)
nRowsDf <- nrow(df)
for(i in 1:nRowsDf){
df[i,1] =ifelse(is.na(df[i,1]), lag(df[i,1])+3*lag(df[i,1]), df[i,1])
}
The above code does not give me an error but does not do the job either.
In addition, is there a better way to do this instead of writing a loop?
Update and Data:
Here is an example of data. I want to replace each missing value in the first column of my dataframe with the previous one multiplied by a scalar (eg. 3). The NA values are in subsequent rows.
df <- mtcars
df[c(2,3,4,5),1] <-NA
IND <- is.na(df[,1])
df[IND,1] <- df[dplyr::lead(IND,1L, F),1] * 3
The last line of the above code does the job row by row (I should run it 4 times to fill the 4 missing rows). How can I do it once for all rows?
reproducible data which YOU should provide:
df <- mtcars
df[c(1,5,8),1] <-NA
code:
IND <- is.na(df[,1])
df[IND,1] <- df[dplyr::lag(IND,1L, F),1] * 3
since you use lag I use lag. You are saying "previous". So maybe you want to use lead.
What happens if the first value in lead case or last value in lag case is missing. (this remains a mystery)
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am trying to determine which variables in my V1 column have values in the V5 column that are in the range of 95-105 and also have values in the V6 column that are in the 7-13 range. I am using the which function and attempting to store the names of the variables in V1 under the variable x but I keep getting the output integer(0) or character(0) and I'm not sure what that means. An image of my code is attached below.
integer(0) means there are no elements of your data frame that satisfy the conditions. (You could try
with(df, any(95 <= V5 & V5 <= 105 &
13 <= V6 & V6 <= 17))
(edited on the basis of #H1's comment, to match your description rather than your code); rearranging slightly to approximate the A < B < C syntax that R's parser can't handle ...)
You should probably check str(df) and/or summary(df) (or sapply(df, class)) to make sure that your data frame has really been read in as intended (or use dplyr::read_csv(), which prints information about the classes inferred from the data set. In particular, any typos in your data that make an entry not be a valid number (extra decimal point, missing value such as "?" not recognized as missing, etc.) will make R interpret the entire column as a character (since you've set stringsAsFactors=FALSE) rather than a numeric variable.
If you want to force columns 2-14 to numeric, you can use df[-1] <- lapply(df[-1], as.numeric) however, it would be better practice to find and fix any problems upstream ...
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I am currently working with the data set "mhw.csv" , located at https://datahub.io/nl/dataset/mercer-and-hall-wheat-yield-dat
Which is a data frame pertaining
The data frame is separated into 4 columns:
"r" "c" "wheat" "straw"
Column r is a row number and c is a column number corresponding to an individual plot in the field. The field is 20 x 25. With a length of 500.
I want to divide the data into 4 quadrants, a NorthWest (rows 1:5 and Columns 1:12) NorthEast (rows 1:5 and columns 13:25) SouthWest (rows 5:10 and columns 1:12) SouthEast (rows 5:10 and columns 13:25)
Then add a 5th column to the data.frame that would denote where each of the plot is located.
Any help would be greatly appreciated. This is my first question, I hope I gave enough information.
Thank you!
I'm not going to go download that data, but using sample data:
test1 <- data.frame(r = sample(1:10, 10), c = sample(1:25, 10))
The simplest no-frills answer is probably:
test1$Quadrant[test1$r<=5 & test1$c<=12] <- "Northwest"
test1$Quadrant[test1$r>5 & test1$c<=12] <- "Southwest"
...
Et cetera. Do it for your four quadrants and the dataframe should now have the new column you're looking for.
PS: Generally you'll get quicker answers if you provide a sample dataframe like I did above with 'test1'.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have an R dataframe with the dimension 32 x 11. For each row I would like to determine the highest value, the second highest, and the third highest value and add these values as extra colums to the initial dataframe (32 x 14). Many thanks in advance!
library(car)
data(mtcars)
mtcars
First, create a function to get the nth highest value for a vector. Then, create a copy of the dataframe, since the second highest value may change as you add more columns. Then apply your function using apply and 1 to operate row-wise. I'm not sure what would happen if there are NAs in the data. I haven't tested it...
Something like this...
nth_highest <- function(x, n)sort(x, decreasing=TRUE)[n]
tmp <- mtcars
mtcars$highest <- apply(tmp, 1, function(x)nth_highest(x,1))
mtcars$second_highest <- apply(tmp, 1, function(x)nth_highest(x,2))
mtcars$third_highest <- apply(tmp, 1, function(x)nth_highest(x,3))
rm(tmp)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a dataframe of the correlation between 45 variables, and have added the the random forest importance value given to each by the 'varImp' function (I ran a random forest training model with this data.
I would like to run through each column, and wherever a variable has a correlation over .8 (in absolute terms), remove either that row variable or that column variable, whichever has the lower 'varImp' importance. I would also like to remove the same variable from the column/row (since it's a correlation matrix, all variables show up in both a row and a column).
For example, roll_belt and max_picth_belt have a correlation of ~.97, and because roll_belt has a value of 3.77 compared to max_picth_belt's 3.16, I would like to delete max_pitch_belt both as a row, and as a column.
Thanks for your help!
I'm sure there must be a more straightforward way. Still, my code does the job.
Assume, we've loaded your dataset into an object called df (I do not include the code to get your data as it is not relevant).
First, it's handy to I split the data itself and the value column that is used for testing feature importance. New object called test.value is the 46-th column.
test.value <- df$value
df <- df[,-ncol(df)] # remove the last column from the dataset
Now we are ready to start.
The framework. We need to identify the numbers of rows/columns to remove from the dataset. So we will:
go column by column
identify the positions of all correlates bigger than 0.8
compare feature importance one by one in a nested loop
record the row/column numbers that should be removed in an object
remove
finally, remove the chosen rows/columns
The code is:
remove <- c() # a vector to store features to be removed
for(i in 1:ncol(df)){
coli <- df[,i] # pick up i-th column
highcori <- coli>.8 & coli!=1 # logical vector of cors > 0.8
# go further only if there are cors > 0.8
if(sum(highcori,na.rm = T)>0){
posi <- which(highcori) # identify positions of cors > 0.8
# compare feature importance one by one
for(k in 1:length(posi)){
remi <- ifelse(test.value[i]>test.value[posi[k]],posi[k],i)
remove <- c(remove,remi) # store the less valued feature
}
}
}
remove <- sort(unique(remove)) # keep only unique entries
df.clean <- df[-remove,-remove] # finally, clean the dataset
That's it.
UPDATE
For those who can provide a better solution, here are the data in an easily readable form, cor.remove.RData
OR
if you prefer dput
dput.df.txt
dput.test.value.txt
I would be interested to see a better way of solving the task.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I would like to do colwise sum of matrix that follow a particular sequence. For example, if I have a matrix of 50 rows, the first four rows will be added in a colwise manner, then 2 to 5 rows, 3 to 6, ... etc. following that pattern. How can I do this in R?
set.seed(123)
mat <- matrix(sample(100,50*10,replace=TRUE),nrow=50)
n <- nrow(mat)
sapply(1:(n-3), function(i) colSums(mat[i:(i+3),]))
#Update
oddInd <- sapply(1:(n-3), function(i) {ind <-i:(i+3); ind[!!ind%%2] })
evenInd <- sapply(1:(n-3), function(i) {ind <-i:(i+3); ind[!ind%%2] })