Replacing values from a column using a condition in R - r

I have a very basic R question but I am having a hard time trying to get the right answer. I have a data frame that looks like this:
species <- "ABC"
ind <- rep(1:4, each = 24)
hour <- rep(seq(0, 23, by = 1), 4)
depth <- runif(length(ind), 1, 50)
df <- data.frame(cbind(species, ind, hour, depth))
df$depth <- as.numeric(df$depth)
What I would like it to select AND replace all the rows where depth < 10 (for example) with zero, but I want to keep all the information associated to those rows and the original dimensions of the data frame.
I have try the following but this does not work.
df[df$depth<10] <- 0
Any suggestions?

# reassign depth values under 10 to zero
df$depth[df$depth<10] <- 0
(For the columns that are factors, you can only assign values that are factor levels. If you wanted to assign a value that wasn't currently a factor level, you would need to create the additional level first:
levels(df$species) <- c(levels(df$species), "unknown")
df$species[df$depth<10] <- "unknown"

I arrived here from a google search, since my other code is 'tidy' so leaving the 'tidy' way for anyone who else who may find it useful
library(dplyr)
iris %>%
mutate(Species = ifelse(as.character(Species) == "virginica", "newValue", as.character(Species)))

Related

Transformation from long to wide with multiple observations in R

I want to transform a data set from long to wide.
The data contains multiple observations for each time point.
To illustrate, consider the following two examples.
In EXAMPLE 1 below, the data does not contain multiple observations and can be transformed from long to wide.
In EXAMPLE 2 below, the data does contain multiple observations (n=3 per time point) and cannot be transformed from long to wide, testing with dcast and pivot_wider.
Can anyone suggest a method to transform the test data from EXAMPLE 2 into a valid format?
Code to reproduce the problem:
library(ggplot2)
library(ggcorrplot)
library(reshape2)
library(tidyr)
library(data.table)
# EXAMPLE 1 (does work)
# Test data
set.seed(5)
time <- rep(c(0,10), 1, each = 2)
feature <- rep(c("feat1", "feat2"), 2)
values <- runif(4, min=0, max=1)
# Concatenate test data
# test has non-unique values in time column
test <- data.table(time, feature, values)
# Transform data into wide format
test_wide <- dcast(test, time ~ feature, value.var = 'values')
# EXAMPLE 2 (does not work)
# Test data
set.seed(5)
time <- rep(c(0,10), 2, each = 6)
feature <- c(rep("feat1", 12), rep("feat2", 12))
values <- runif(24, min=0, max=1)
# Concatenate test data
# test has non-unique values in time column
test <- data.table(time, feature, values)
# Transform data into wide format
test_wide <- dcast(test, time ~ feature, value.var = 'values')
Warning:
Aggregate function missing, defaulting to 'length'
Problem:
Non-unique values in first column (time) are not preserved/allowed.
# Testing with pivot_wider
test_wider <- pivot_wider(test, names_from = feature, values_from = values)
Warning:
Warning message:
Values are not uniquely identified; output will contain list-cols.
Problem:
Non-unique values in first column (time) are not preserved/allowed.
In lack of a better idea, a possible output could look like this:
time
feat1
feat2
0
0.1046501
0.5279600
0
0.7010575
0.8079352
0
0.2002145
0.9565001
etc.
Since there are multiple values, it is not obvious how these should be treated when converting to a wide format. That's why you get the warning messages. This is one way of handling them. If you want something else, then please give a specific example of what the output should look like.
pivot_wider(test, names_from = feature, values_from = values) %>%
unnest(c(feat1, feat2))
You may want something like this:
library(dplyr)
test %>%
pivot_wider(names_from = c(feature, time),
values_from = values)
where the c(feature, times) accounts for the multiple variable case. But as was already pointed out in the comments please indicate if you want something else.

Adding a column to a data frame using mutate in R

I am working with OJdata set in ISLR package. I need to add to columns to the data frame. One column is a product of two numerical variable. The second column is a product of numerical and categorical variables .
I added the first column (numerical*numerical) using mutate function in dplyr package in R as follows,
require(ISLR)
OJ %>%
mutate(`StoreID:PriceCH` = StoreID*PriceCH)
And i was able to add this coulmn sucessfully. But when i tried to do the same when adding the categorical*numeric column i am getting an error.
OJ %>%
mutate(`Store7:PriceCH` = Store7*PriceCH)
Warning message:
In Ops.factor(Store7, PriceCH) : ‘*’ not meaningful for factors
Can anyone suggest what i can do if i need to add coulmn which is a product of categorical*numerical ?
My output should be something like this,
Thank you
Apply one-hot encoding to Store7 first:
OJ <- cbind(OJ, sapply("Yes", function(x) as.integer(x == OJ$Store7)))
names(OJ)[ncol(OJ)] <- "Store7_Yes"
Conceptually, I does not make a lot of sense (in most of the cases) multiply categorical variables.
Thought if you want to do so, you have to transform your data to a numeric scale. Be aware that this is not always so straightfoward.
This could be a starting point:
library(tidyverse)
Result <- OJ %>%
mutate(`StoreID:PriceCH` = StoreID*PriceCH) %>%
mutate(Store7Numeric = as.character(Store7)) #To avoid possible mistakes
Result <- Result %>%
mutate(Store7Numeric = ifelse(Store7Numeric == "No", 0, 1)) #Check this
Result <- Result %>% mutate(Store7Numeric = as.numeric(Store7Numeric)) %>% #To numeric
mutate(`Store7:PriceCH` = Store7Numeric*PriceCH) %>% #Your calculation
select(-Store7Numeric) #Remove, if you want. the created numeric variable
The error message is due to variable Store7 being a factor (See in str(OJ)), so you must make it numeric:
OJ$Store7 <- as.numeric(OJ$Store7)

How to use if-statement in apply function?

Since I have to read over 3 go of data, I would like to improve mycode by changing two for-loop and if-statement to the applyfunction.
Here under is a reproducible example of my code. The overall purpose (in this example) is to count the number of positive and negative values in "c" column for each value of x and y. In real case I have over 150 files to read.
# Example of initial data set
df1 <- data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15))
# Another dataframe to keep track of "c" counts
dfOcc <- data.frame(a=rep(c(1:5),times=3),"positive"=c(0),"negative"=c(0))
So far I did this code, which works but is really slow:
for (i in 1:nrow(df)) {
x = df[i,"a"]
y = df[i,"b"]
if (df[i,"c"]>=0) {
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] +1
}else{
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] +1
}
}
I am unsure whether the code is slow due to the size of the files (260k rows each) or due to the for-loop?
So far I managed to improve it in this way:
dfOcc[which(dfOcc$a==df$a & dfOcc$b==df$b),"positive"] <- apply(df,1,function(x){ifelse(x["c"]>0,1,0)})
This works fine in this example but not in my real case:
It only keeps count of the positive c and running this code twice might be counter productive
My original datasets are 260k rows while my "tracer" is 10k rows (the initial dataset repeats the a and b values with other c values
Any tip on how to improve those two points would be greatly appreciated!
I think you can simply count and spread the data. This will be easier and will work on any group and dataset. You can change group_by(a) to group_by(a, b) if you want to count grouping both a and b column.
library(dplyr)
library(tidyr)
df1 %>%
group_by(a) %>%
mutate(sign = ifelse(c > 0, "Positive", "Negative")) %>%
count(sign) %>%
spread(sign, n)
package data.table might help you do this in one line.
df1 <- data.table(data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15)))
posneg <- c("positive" , "negative") # list of columns needed
df1[,(posneg) := list(ifelse(c>0, 1,0), ifelse(c<0, 1,0))] # use list to combine the 2 ifelse conditions
for more information , try
?data.table
if you really want the positive negative counts to be in a separate dataframe,
dfOcc <- df1[,c("a", "positive","negative")]

Categorize large factor into small factors based on frequency with remaining entries as 'Others'

I have a large factor (df$name) with more than 1000 factors. What I need is the top 10-15 factors by frequency and the remaining factors clubbed together as 'others'
I tried using the following command but wasn't successful:
df$name <- levels(df$name)[which(table(df$name)<1000000)] <- "Others"
PS: I'm using a frequency count since I don't want to restrict myself with a specific count of factors here. I'm happy if I get anywhere from 5-20 top factors (by frequency) and the rest of them combined together as 'Others' for easy visualization.
First of all, I would count name frequency by using table() & top_n() to specify top 15 (or 10) names in your data set. (I contained them in top_15_names object.) After that I did create name_category column to show groups of names by using mutate(). Here is how I would do it.
df$name = as.factor(df$name)
top_15 = data.frame(table(df$name)) %>%
arrange(desc(Freq)) %>%
top_n(15)
top_15_names = top_15$Var1
dat = df %>%
mutate(name_category = case_when(
name %in% top_15_names ~ "Top15",
TRUE ~ "Others"
))
I hope you find this helpful.
Here's a column in a data frame with 2000 factors:
df <- data.frame(names = sample(1:2000, 1E6, replace = T))
df$names <- as.factor(df$names)
And here a new variable is added which keeps the top 15 and puts the rest in "Other."
df$names_lump = forcats::fct_lump(df$names, n = 15)

Aggregate variables separetly in R [lapply + aggregate]

I have a data.frame with a set of records and as variables different measurements. I would like to create a new data.frame containing the amount of records having a specific measurement value for each measurement. Basically what I am trying to do is:
record <- c("r1", "r2", "r3")
firstMeasurement <- c(15, 10, 10)
secondMeasurement <- c(2, 4, 2)
df <- data.frame(record, firstMeasurement, secondMeasurement)
measurements <- c(colnames(df[2:3]))
measuramentsAggregate <- lapply(measurements, function(i)
aggregate(record~i, df, FUN=length))
I am getting really funny errors and I don't understand why. Can anyone help me?
Many thanks!
I think this is what you want
library(dplyr)
agg.measurements <- df %>% group_by(firstMeasurement) %>% summarise(records=n())
That should do it for the one.
If you want the number of records with specific firstMeasurement:
table(df$firstMeasurement)
Likewise for the secondMeasurement. I am not sure how the data.frame you are trying to create might look.

Resources