Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a dataframe of the correlation between 45 variables, and have added the the random forest importance value given to each by the 'varImp' function (I ran a random forest training model with this data.
I would like to run through each column, and wherever a variable has a correlation over .8 (in absolute terms), remove either that row variable or that column variable, whichever has the lower 'varImp' importance. I would also like to remove the same variable from the column/row (since it's a correlation matrix, all variables show up in both a row and a column).
For example, roll_belt and max_picth_belt have a correlation of ~.97, and because roll_belt has a value of 3.77 compared to max_picth_belt's 3.16, I would like to delete max_pitch_belt both as a row, and as a column.
Thanks for your help!
I'm sure there must be a more straightforward way. Still, my code does the job.
Assume, we've loaded your dataset into an object called df (I do not include the code to get your data as it is not relevant).
First, it's handy to I split the data itself and the value column that is used for testing feature importance. New object called test.value is the 46-th column.
test.value <- df$value
df <- df[,-ncol(df)] # remove the last column from the dataset
Now we are ready to start.
The framework. We need to identify the numbers of rows/columns to remove from the dataset. So we will:
go column by column
identify the positions of all correlates bigger than 0.8
compare feature importance one by one in a nested loop
record the row/column numbers that should be removed in an object
remove
finally, remove the chosen rows/columns
The code is:
remove <- c() # a vector to store features to be removed
for(i in 1:ncol(df)){
coli <- df[,i] # pick up i-th column
highcori <- coli>.8 & coli!=1 # logical vector of cors > 0.8
# go further only if there are cors > 0.8
if(sum(highcori,na.rm = T)>0){
posi <- which(highcori) # identify positions of cors > 0.8
# compare feature importance one by one
for(k in 1:length(posi)){
remi <- ifelse(test.value[i]>test.value[posi[k]],posi[k],i)
remove <- c(remove,remi) # store the less valued feature
}
}
}
remove <- sort(unique(remove)) # keep only unique entries
df.clean <- df[-remove,-remove] # finally, clean the dataset
That's it.
UPDATE
For those who can provide a better solution, here are the data in an easily readable form, cor.remove.RData
OR
if you prefer dput
dput.df.txt
dput.test.value.txt
I would be interested to see a better way of solving the task.
Related
I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I want to replace each missing value in the first column of my dataframe with the previous one multiplied by a scalar (eg. 3)
nRowsDf <- nrow(df)
for(i in 1:nRowsDf){
df[i,1] =ifelse(is.na(df[i,1]), lag(df[i,1])+3*lag(df[i,1]), df[i,1])
}
The above code does not give me an error but does not do the job either.
In addition, is there a better way to do this instead of writing a loop?
Update and Data:
Here is an example of data. I want to replace each missing value in the first column of my dataframe with the previous one multiplied by a scalar (eg. 3). The NA values are in subsequent rows.
df <- mtcars
df[c(2,3,4,5),1] <-NA
IND <- is.na(df[,1])
df[IND,1] <- df[dplyr::lead(IND,1L, F),1] * 3
The last line of the above code does the job row by row (I should run it 4 times to fill the 4 missing rows). How can I do it once for all rows?
reproducible data which YOU should provide:
df <- mtcars
df[c(1,5,8),1] <-NA
code:
IND <- is.na(df[,1])
df[IND,1] <- df[dplyr::lag(IND,1L, F),1] * 3
since you use lag I use lag. You are saying "previous". So maybe you want to use lead.
What happens if the first value in lead case or last value in lag case is missing. (this remains a mystery)
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
For example, I have a data.frame with 40 rows and 20 columns and want to create 100 variables assigned to the name of the first row (a string):
row_name_1 <- df[1, ]
Is there a way to write a loop to do this for all 100 rows that would save the trouble of typing 40 lines of code?
I have tried using this code:
Phoneme_Features.list <- setNames(split(Phoneme_Features,
seq(nrow(Phoneme_Features))), rownames(Phoneme_Features))
The specific application for this would be to be able to search another data frame based on the values from the first data frame.
I have 2 data frames: Phoneme_Features and Phonetic_Dictionary (with 130,000 rows). Phoneme features is data frame where each row corresponds to around 20 phonetic features (e.g. F = consonant = 1, vowel = 0, labial = 1, dental = 1, etc). The Phonetic_Dictionary contains 130,000 words with the corresponding phonetic transcription (e.g. phonetics F AH0 N EH1 T IH0 K S)
I want to use the new variables to replace the values of another data frame (stored as factors) so that I can search items in the second data frame by the features in the first data frame (Phoneme Features).
I would like to be able to search Phonetic_Dictionary and return every entry in which the first column contains a value of 1 for consonant. In other words, to be able to search the dictionary for all entries with an initial consonant, or final high vowel, or any other feature from the first data frame Phoneme_Features.
You can use assign() and paste0() to create variables names programatically.
An example using the iris dataset:
for(i in 1:nrow(iris)){
assign(paste0('row_name_',i),iris[i,])
}
paste0() attaches the row number, i, to the string row_name_ and then assign() then assigns the newly created variable name to the enviroment with a value of iris[i,]
Thanks for everybody's help. I was able to get what I wanted by using:
for(i in 1:nrow(Phoneme_Features)){
assign(paste0(Phoneme_Features[i, ]), Phoneme_Features[i, ])}
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a data frame df with multiple factor columns, say column A with factors a,b,c, column B with factors m, f and so on.
Each of these columns have NA's.
How can I fill NA's with a, b, c and m, f according to their
distribution in the column (for example if I have 50% males and 50%
females (for simplicity) I will fill my NA's 50% as males, 50% as females)?
Is it a good technique if I have around 550 observations of data and 41
columns?
Next step will be to resample it to make the data set bigger and apply ML on the data set - please tell me which function will enlarge this data set to be 10000 observations or more?
Thanks in advance!
You could use the following code (please see below a couple of comments) (I created a small data frame to give you a concrete example)
A_ <- c(rep("a", 10), rep("b", 60), rep("c", 30), rep(NA, 200))
A <- data.frame(A_)
names(A) <- c("A")
b <- sample(c("a","b","c"), size = 200, prob = c(10,60,30)/100,replace = TRUE)
A[is.na(A)] <- b
And you can check with
table(A)
Now you should be careful in changing the NA values. First of all I would check why you have NA. Maybe there is a reason why the information is not there. Changing the NA with values from a distribution, you automatically assume that the missing data follows the same distribution. Is really so? 550 observations are really not that many to talk about distribution. Maybe you need to simply ignore the records with NA?
Regarding your second question you cannot simply generate new data from your existing one. In some cases (see images for example, where you can tilt them, shift them and so on) you can "augment" your data set. With small data set as the one you describe I would not do it.
It all depends on the kind of data you have. But my first impression is that in your cases, with 41 factors and many NA, you cannot simply augment your data.
Possibly knowing a bit more about your dataset could help us in giving you a more precise help.
I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)