New category column based on column value - r

I am trying to use if and ifelse statement to create a new column in my dataframe based on the values of an existing column. For example, I have a variable column which has numbers from 1 to 10000. I want to categorize them into 8 buckets (using 1250 size) in my new column. So if I have 1 in my column, I should get b1 in the new column. If I have 9999, I should get b8, etc. My if else code is failing so far.
Thanks!

You can use cut :
paste0('b', cut(1:10000, 8, labels = FALSE))
Replace 1:10000 by column values (df$colum_name).

Related

How do I subtract values from one column from another in a dataframe?

I have a dataframe where the rows are the names of different genes, with 2 columns called: Control_mean and Patient_mean.
I want to create a third column where I store the value of "Patient_mean - Control_mean" for each row respectively but I cant figure out how!
I tried to do so using this:
for(i in 1:nrow(newdf8)){
newdf8$log2FC[i] <- (newdf8[,2] - newdf8[,1])
}
but it didnt work, since all the values in the new column became the same number, and not the value of the actual difference.

Can I replace values in cells conditional on a string?

I am new to R and trying to create my own dataset by modifying Eurostat data. I now have regions with names such as AT111, AT112 and ITC11. I want to give each country a number, so that all regions from AT have a country code equal to 1.
For that I have added a new empty column to my dataset. Is there a way for me to do this:
NUTS3.3[NUTS3.3$geo == "AT111", "country"] <- 1
for all observations whose geo string contains "AT" at once?
I have >26 000 observations, so doing it for every single regional code would be tedious.
We can get the substr of the column and do the ==
NUTS3.3$country[substr(NUTS3.3$geo, 1, 2)=="AT"] <- 1

How to add new vector that includes 0 and cumsum to dataframe?

I want to add a new vector to a dataframe based on cumsum of previous column but starting from 0.
I've tried to create a vector with 0 and then the cumsum function but I have an additional row now from this. I've tried to remove the additional row but cannot.
mydata$time<-c(0,cumsum(mydata$duration))
Error in $<-.data.frame(*tmp*, time, value = c(0, 5, 9, 15.4, :
replacement has 1138 rows, data has 1137
We may need to remove the last element and then do the cumulative sum, otherwise, it would have a mismatch between the number of rows of the original column and the new vector created
cumsum(c(0, mydata$duration[-nrow(mydata)]))

Insert all missing rows into data table for a range of values for 2 columns

I am interested in inserting all missing rows into a data table for a new range of values for 2 columns.
Example, dt1[,a] has some values from 1 to 5, as does dt1[,b], but i'd like not only all pair wise combinations to be present in columns a and b, but all combinations to be present in a newly defined range, e.g. 1 to 7 instead.
# Example data.table
dt1 <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4,4,5,5,5),
b=c(1,3,4,5,1,2,3,4,1,2,3,1,2,3,4,5,3,4,5),
c=sample(1:10,19,replace=T))
setkey(dt1,a,b)
# CJ in data.table will create all rows to ensure all
# pair wise combinations are present (using the nominated columns).
dt1[CJ(a,b,unique=T)]
The above is great but will only use the max and min in the nominated columns. I'd like the inserted rows to give me all combinations between a new, nominated range, e.g. 1 to 7. There would be 49 rows.
# the following is a temporary workaround
template <- data.table(a1=rep(1:7,each=7),b1=rep(1:7,7))
setkey(template,a1,b1)
full <- dt1[template]
Instead of the already existing values in 'a' column, we can have a range of values to pass into 'CJ' for the 'a'
dt1[CJ(a = 1:7, b, unique = TRUE)]

R: returning the 5 rows with the highest values

Sample data
mysample <- data.frame(ID = 1:100, kWh = rnorm(100))
I'm trying to automate the process of returning the rows in a data frame that contain the 5 highest values in a certain column. In the sample data, the 5 highest values in the "kWh" column can be found using the code:
(tail(sort(mysample$kWh), 5))
which in my case returns:
[1] 1.477391 1.765312 1.778396 2.686136 2.710494
I would like to create a table that contains rows that contain these numbers in column 2.
I am attempting to use this code:
mysample[mysample$kWh == (tail(sort(mysample$kWh), 5)),]
This returns:
ID kWh
87 87 1.765312
I would like it to return the r rows that contain the figures above in the "kWh" column. I'm sure I've missed something basic but I can't figure it out.
We can use rank
mysample$Rank <- rank(-mysample$kWh)
head(mysample[order(mysample$Rank),],5)
if we don't need to create column, directly use order (as #Jaap mentioned in three alternative methods)
#order descending and get the first 5 rows
head(mysample[order(-mysample$kWh),],5)
#order ascending and get the last 5 rows
tail(mysample[order(mysample$kWh),],5)
#or just use sequence as index to get the rows.
mysample[order(-mysample$kWh),][1:5]

Resources