Populate cell with the column name of the max value in corresponding row - r

I am practicing my R programming skills using Kaggle data sets, and I could use some help. I am working on the Ghosts, Ghouls, and Goblins data set and the goal is to predict which type of monster each row represents based on a set of descriptive stats. I trained a multinomial logistic regression model using a training data set to get probability values for each of the 3 types, and now I just want to put the name of the monster in the last cell of each row in the test data set based on on the max probability from 3 columns in that row. Here is the head of my table: predProbs Table
What I have currently tried seems to populate every cell in the type column with the same value. How can I calculate the max probability within the columns "Ghost", "Ghoul", and "Goblin", get the column name of the column containing the max value, and then populate the last cell in every row (column name: type) with the name? I want to do this for every row in the test data set. This is what I am currently trying to do and then just cbind typesList with the whole list called predProbs.
for (i in nrow(predProbs)) {typesList = append(typesList, which.max(apply(predProbs[i,7:9], MARGIN = 2, max)))}
But this doesn't seem to be creating the vector that I need. Any thoughts?
This is similar to this post: find max value in a row and update new column with the max column name
But, unfortunately, I'm not very fluent in SQL yet so I'm not able to translate it to R.
Any help would be greatly appreciated. Thanks!
-Wes

You should think of something like this:
t(apply(predProbs,1,function(i)append(i,names(predProbs)[which.max(i)],length(i))))

Related

Using group_by to determine median of second column after sorting/filtering for a specific value in first column?

I have a huge dataset which has been difficult to work with.
I want to find the median of a second column but only based on one value in the first column. I have used this formula to find general medians without specifying/sorting by the specific values in the first column:
df%>% +group_by(column1)%>% +summarise(Median=median(colum2))
However, there is a specific value in column1 I am hoping to sort by and I only want the medians of the second column based on this first value. Would I do something similar to the below?
df%>% +group_by(column1, specificvalue)%>% +summarise(Median=median(colum2))
Is there an easier way to do this? Would it be easier to make a new dataframe based on the specific value in the first column? How would that be done so that I could have column 1 only include the specific value I want but the rest of the rows included so I can easily determine the median of column2?
Thanks!!

How to choice specific data value by row or column index in SAS?

I usually use R, but I just start studying SAS.
In R, we can make some data.frame like this :
df <- as.data.frame(matrix(c(1:6),nrow=2,ncol=3))
and then
df[1,2]
is 3.
Here is my question. How can I use row and column index in SAS?
I coudln't find this..
I want to use row and column number by index of double loop
If the row number and column number have meaning then you probably do not want to store your "matrix" in that form. Instead you probably want to store it in a tall format where the row and column values are stored in variables and the values of the cells in your matrix are stored in another variable. Since you didnt' provide any meaning to your example let's just name these variables ROW, COL and VALUE.
data have;
do col=1 to 3 ;
do row=1 to 2 ;
value+1;
output;
end;
end;
run;
Now if you want to find the value when ROW=1 and COL=2 it is a simple WHERE condition.
proc print data=have;
where row=1 and col=2;
run;
Result:
Obs col row value
3 2 1 3
In a real dataset the ROW might be the individual case or person ID and the COL might be the YEAR or VISIT_NUMBER or SAMPLE_NUMBER of the value.
You access columns via names and rows via _n_ if you really need but there isn't a good usage for this type of logic.
For example, if you wanted the third row and second variable from the SASHELP CLASS data set.
Note that you need to know the name of the variable, you cannot rely on the index/position.
This displays the information:
proc print data=sashelp.class(firstobs = 2 obs=2);
var Age;
run;
This puts in a data set, two different ways:
data want;
set sashelp.class (firstobs = 2 obs=2);
keep age;
run;
data want;
set sashelp.class ;
if _n_ = 3; *filters only the third row into the data set;
run;
R & Python import all data into memory and use that method as the default processing method. SAS instead only loads one row at a time and then continues to loop through each row within a data step, so you have to think of each step differently. Basically break your processes into smaller steps and they work. SAS does have some really nice built in functionality, like confidence intervals by default or the ability to aggregate data at multiple levels within a single procedure.

How to code a numeric field in r by a set of labels

I have a large data frame with around 190000 rows. The data frame has a label column storing 12 nominal categories. I want to change the weight column value of each row based on the label value of that row. For example, if the label of a row is "Res", I want to change its weight field value to 0.5 and if it is "Condo", I want to change its weight value to 2.
I know it is easy to implement this by if else statement but given the number of rows, the processing time takes so much long. I wanted to use cut() but it seems that cut categorizes based on intervals not nominal categories. I would appreciate any suggestion that can decrease the processing time.

Editting randomly sampled subset of an indexed subset in R?

I have a question about indexing and editing data structures in R. For instance, suppose I have a data frame myDF:
myDF=data.frame(a=rep(c(1,2),10), b=rep(0,20), c=rep(0,20), d=rep(0,20))
I know that I can use column a to index other columns and edit them like this:
myDF$b[myDF$a==1]=3
And I know I can use sample() to get 5 cells at random from a column and edit them like this:
myDF$c[sample(1:20,5)]=6
But how can I select a specific number of cells at random from among those selected based on another column, for editing purposes? E.g. what if I want to set the value of 5 random cells from d to 4 with the constraint that all of these cells also be from rows in which a==1?
You can combine sample and subsetting like his :
myDF$d[sample(which(myDF$a==1),5)]<-4
which selects the rows that fit the condition, then sample just select five of them and you update these rows d value.

Conditional operation on two data frames (R)

I'm having some difficulty executing a conditional operation on two dataframes. For problem illustration, I have three variables: Price, State, and Item, which are stored in a data frame (data1) with those column names. I use ddply to generate a dataframe (data2) that includes columns State and Item, and the average price(or some other function) for that State/Item combination.
What I then want to do is fill in a column in the originating data frame(i.e. a simple prediction vector), where the column's value is the mean value for a given observations combination of State and Item in data1. (e.g., if an observation in data1 has state="Arizona" and item="pen", I then want to retrieve the average price stored in data2 that corresponds to that state/item combination, and insert it into the column.)
Thank you for any help.
The plyr package comes with a great little function called join. You can use this to complete your task.
join(dat1,dat2, by=c('State','Item'))
Review ?join to see the different types of joins possible. I'm pretty sure you want a left join.

Resources