How to code a numeric field in r by a set of labels - r

I have a large data frame with around 190000 rows. The data frame has a label column storing 12 nominal categories. I want to change the weight column value of each row based on the label value of that row. For example, if the label of a row is "Res", I want to change its weight field value to 0.5 and if it is "Condo", I want to change its weight value to 2.
I know it is easy to implement this by if else statement but given the number of rows, the processing time takes so much long. I wanted to use cut() but it seems that cut categorizes based on intervals not nominal categories. I would appreciate any suggestion that can decrease the processing time.

Related

How to remove rows in a data set according to if values exceed a given number in a particular column in Rstudio

I am trying to remove some outliers from my data set. I am investigating each variable in the data one at a time. I have constructed boxplots for variables but don't want to remove all the classified outliers, only the most extreme. So I am noting the value on the boxplot that I don't want my variable to exceed and trying to remove rows that correspond to the observations that have a specific column value that exceed the chosen value.
For example,
My data set is called milk and one of the variables is called alpha_s1_casein. I thought the following would remove all rows in the data set where the value for alpha_s1_casein is greater than 29:
milk <- milk[milk$alpha_s1_casein < 29,]
In fact it did. The amount of rows in the data frame decreased from 430 to 428. However it has introduced a lot of NA values in noninvolved columns in my data set
Before I ran the above code the amount of NA's were
sum(is.na(milk))
5909 NA values
But after performing the above the sum of NA's now returned is
sum(is.na(milk))
75912 NA values.
I don't understand what is going wrong here and why what I'm doing is introducing more NA values than when I started when all I'm trying to do is remove observations if a column value exceeds a certain number.
Can anyone help? I'm desperate
Without using additional packages, to remove all rows in the data set where the value for alpha_s1_casein is greater than 29, you can just do this:
milk <- milk[-which(milk$alpha_s1_casein > 29),]

Populate cell with the column name of the max value in corresponding row

I am practicing my R programming skills using Kaggle data sets, and I could use some help. I am working on the Ghosts, Ghouls, and Goblins data set and the goal is to predict which type of monster each row represents based on a set of descriptive stats. I trained a multinomial logistic regression model using a training data set to get probability values for each of the 3 types, and now I just want to put the name of the monster in the last cell of each row in the test data set based on on the max probability from 3 columns in that row. Here is the head of my table: predProbs Table
What I have currently tried seems to populate every cell in the type column with the same value. How can I calculate the max probability within the columns "Ghost", "Ghoul", and "Goblin", get the column name of the column containing the max value, and then populate the last cell in every row (column name: type) with the name? I want to do this for every row in the test data set. This is what I am currently trying to do and then just cbind typesList with the whole list called predProbs.
for (i in nrow(predProbs)) {typesList = append(typesList, which.max(apply(predProbs[i,7:9], MARGIN = 2, max)))}
But this doesn't seem to be creating the vector that I need. Any thoughts?
This is similar to this post: find max value in a row and update new column with the max column name
But, unfortunately, I'm not very fluent in SQL yet so I'm not able to translate it to R.
Any help would be greatly appreciated. Thanks!
-Wes
You should think of something like this:
t(apply(predProbs,1,function(i)append(i,names(predProbs)[which.max(i)],length(i))))

Conditionally create new column in R

I would like to create a new column in my dataframe that assigns a categorical value based on a condition to the other observations.
In detail, I have a column that contains timestamps for all observations. The columns are ordered ascending according to the timestamp.
Now, I'd like to calculate the difference between each consecutive timestamp and if it exceeds a certain threshold the factor should be increased by 1 (see Desired Output).
Desired Output
I tried solved it with a for loop, however that takes a lot of time because the dataset is huge.
After searching for a bit I found this approach and tried to adapt it: R - How can I check if a value in a row is different from the value in the previous row?
ind <- with(df, c(TRUE, timestamp[-1L] > (timestamp[-length(timestamp)]-7200)))
However, I can not make it work for my dataset.
Thanks for your help!

Change all cells marked X into cell value shown in column P

I need an easy way to convert all X's in a column into the value shown in a cell.
Basically we want to sell multiple products to a client with a target order value split amongst the relevant products - I have done a CountA formula to show how many columns are not blank. Then I did a simple divide to divide the total value over the columns that are not blank (if there are 2 columns marked X then it would be 10,000 / 2 - assuming the target value is 10k) Now I need to change all the X's into the figure shown in the cell as shown in the pic.
I cant for the life of me think of an easy way of doing it but sureley there is?
Screen shot of sheet
You need 2 sets of columns. Your first set has the X's and the blanks to mark which categories are applicable. The second set has the calculated values for the selected categories.
In the value columns, you can use a formula like the following for the first data row and first category:
=IF(E2 = "x", $D2, 0)
Assuming "D" is your "Dvided total" column, and "E" is your "Dairy & S..." column, and "2" is your first data row.

How to show 3 values on the X axis instead of 1?

I have a bar chart which shows a histogram of weight vs count. The x axis is the weight and the y axis is the count of how many weights fell within that weight range.
I would like to also display the percent of each weight range which would be the weight ranges count/totalcount and show the numeric value of the count next to each weight on the x axis.
So an example of the x axis followed by the y axis should look like this
weight percent count | count
1.000 3.013% 512 | 512
I was able to achieve this by combining the weight, percent and count into one string and then setting the data row of a datatable = to the entire string instead of just the count. However I was wondering if there was a way to accomplish this by placing each data value into its very own column, so when I export it into excel instead of a chart it will have its own column for each value instead of one concatenated column of 3 values.
I could just check if export to excel is selected rather than chart and then dynamically build my data table to include a column for each value or one column with a concatenated value for each row depending on what they select. However I still believe there's a better option that I have not figured out.

Resources