I am new to R and just learning the ropes so thanks in advance for any assistance you can provide.
I have a dataset that I am cleaning as a class project.
I have several sets of categorical data that I want to turn into specific numeric values.
I am repeating the same code format for different columns that I think would make a good function.
I would like to turn this:
# plyr using revalue
df$Area <- revalue(x = df$Area,
replace = c("rural" = 1,
"suburban" = 2,
"urban" = 3))
df$Area <- as.numeric(df$Area)
into this:
reval_3 <- function(data, columnX,
value1, num_val1,
value2, num_val2,
value3, num_val3) {
# plyr using revalue
data$columnX <- revalue(x = data$columnX,
replace = c(value1 = num_val1,
value2 = num_val2,
value3 = num_val3))
# set as numeric
data$columnX <- as.numeric(data$columnX)
# return dataset
return(data)
}
I get the following error:
The following `from` values were not present in `x`: value1, value2, value3
Error: Assigned data `as.numeric(data$columnX)` must be compatible with existing data.
x Existing data has 10000 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unknown or uninitialised column: `columnX`.
I've tried it with a single value1 where value1 <- c("rural" = 1, "suburban" = 2, "urban" = 3)
I know I can just:
df$Area <- as.numeric(as.factor(df$Area))
the data but I want specific values for each choice rather than R choosing.
Any assistance appreciated.
As already mentioned by #MartinGal in his comment, plyr is retired and the package authors themselves recommend using dplyr instead. See https://github.com/hadley/plyr.
Hence, one option to achieve your desired result would be to make use of dplyr::recode. Additionally if you want to write your function I would suggest to pass the values to recode and the replacements as vectors instead of passing each value and replacement as separate arguments:
library(dplyr)
set.seed(42)
df <- data.frame(
Area = sample(c("rural", "suburban", "urban"), 10, replace = TRUE)
)
recode_table <- c("rural" = 1, "suburban" = 2, "urban" = 3)
recode(df$Area, !!!recode_table)
#> [1] 1 1 1 1 2 2 2 1 3 3
reval_3 <- function(data, x, values, replacements) {
recode_table <- setNames(replacements, values)
data[[x]] <- recode(data[[x]], !!!recode_table)
data
}
df <- reval_3(df, "Area", c("rural", "suburban", "urban"), 1:3)
df
#> Area
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 2
#> 6 2
#> 7 2
#> 8 1
#> 9 3
#> 10 3
You can use case_when with across.
If the columns that you want to change are called col1, col2 you can do -
library(dplyr)
df <- df %>%
mutate(across(c(col1, col2), ~case_when(. == 'rural' ~ 1,
. == 'suburban' ~ 2,
. == 'urban' ~ 3)))
Based on your actual column names you can also pass starts_with, ends_with, range of columns A:Z in across.
Related
I want to create a dataframe with a column whose value depends on another object's value.
Here's an example, I want my column to be called "conditional_colname":
x = "conditional_colname"
df <- data.frame(x = c(1, 2, 3))
df
> x
1 1
2 2
3 3
I could try the following indirection syntax in tidy evaluation, but it returns an error:
data.frame({{x}} := c(1, 2, 3))
> Error in `:=`({ : could not find function ":="
I can sort out the problem through the use of the rename function and indirection in tidy evaluation syntax, as in:
df %>% rename({{x}} := x)
> conditional_colname
1 1
2 2
3 3
but that involves creating the dataframe with a wrong name and then renaming it, is there any option to do it from the creation of the dataset?
{{..}} can be used with tibbles -
library(tibble)
library(rlang)
df <- tibble({{x}} := c(1, 2, 3))
df
# A tibble: 3 × 1
# conditional_colname
# <dbl>
#1 1
#2 2
#3 3
A solution with data.frame would be with setNames.
df <- setNames(data.frame(c(1, 2, 3)), x)
I have two dataframes df1 and df2 which I have merged together into another dataframe df3
df1 <- data.frame(
Name = c("A", "B", "C"),
Value = c(1, 2, 3),
Method = c("Indirect"))
df2 <- data.frame(
Name = c("A", "B"),
Value = c(4, 5),
Method = c("Direct"))
df3 <- rbind(df1, df2)
So df3 looks something like this
Now I need to identify all the unique entries in the Name column (which is C in this case) and for each of the unique entries, a row is to be added which would have the same "Name" but "Value" would be 0 and the "Method" would be the opposite one. The output should look like this.
Finally the rows with similar "Name" are to be arranged one below the other.
I have a huge dataframe and I need to achieve the above mentioned outcome in the most efficient way in R. How do I proceed?
One way
tmp=df3[!(df3$Name %in% df3$Name[duplicated(df3$Name)]),]
tmp$Value=0
tmp$Method=ifelse(tmp$Method=="Direct","Indirect","Direct")
Name Value Method
3 C 0 Direct
you can now rbind this to your original data (and sort it).
Please find another solution using data.table
Reprex
Code
library(data.table)
library(magrittr) # for the pipe!
setDT(df3)
df3 <- rbindlist(list(df3,
df3[!(df3$Name %in% df3[duplicated(Name)]$Name)
][, `:=` (Value = 0, Method = fifelse(Method == "Indirect", "Direct", "Indirect"))])) %>%
setorder(., Name)
Output
df3
#> Name Value Method
#> 1: A 1 Indirect
#> 2: A 4 Direct
#> 3: B 2 Indirect
#> 4: B 5 Direct
#> 5: C 3 Indirect
#> 6: C 0 Direct
Created on 2021-12-15 by the reprex package (v2.0.1)
I think that with 10,000 rows you will barely notice it:
library(dplyr)
df3 |>
add_count(Name) |>
filter(n == 1) |>
mutate(
Value = 0,
Method = c(Indirect = 'Direct', Direct = 'Indirect')[Method],
n = NULL
) |>
bind_rows(df3) |>
arrange(Name, Value, Method)
# Name Value Method
# 1 A 1 Indirect
# 2 A 4 Direct
# 3 B 2 Indirect
# 4 B 5 Direct
# 5 C 0 Direct
# 6 C 3 Indirect
I have to apologize in advance if the question is very basic as I am still new to R. I have tried to look on stackoverflow for similar questions, but I still can't resolve the problem that I am facing.
I am currently working on a large dataset X. What I am trying to do is pretty simple. I want to replace all NAs in selected columns (non consecutive columns) with "no".
I firstly have created a variable including all the columns that I want to modify. For instance, if I want to modify the NAs in columns named "m","l" and "h", I wrote the following:
modify <- c("m","l","h")
for (i in 1:length(modify))
column <- modify[i]
X$column <- as.character(X$column) #X is my dataframe
X$column %>% replace_na("no")
This loop returned the output only for the "m" column, which is the first variable in my modify variable. However, even after generating the output after the loop, when I tried to check X$m, nothing has changed in my original dataset.
I also tried to create a function, which is very similar to the loop. Even though no error message was generated, it didn't work as I do not know what the return value should be.
Why can't the loop being applied to my entire dataset while the individual steps in the loop work?
Thank you so so much for your help!
This might help, and was among one of the answers here (but slightly different here using all_of():
library(tidyverse)
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df
#> # A tibble: 3 × 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 <NA>
#> 3 NA b
modify <- c("x","y")
df %>%
mutate(
across(all_of(modify), ~replace_na(.x, 0))
)
#> # A tibble: 3 × 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 0
#> 3 0 b
Created on 2021-09-22 by the reprex package (v2.0.1)
Here's a base R approach modifying data from #scrameri.
df <- data.frame(x = c(1, 2, NA), y = c("a", NA, "b"), c = c(1, NA, 5))
modify <- c('x', 'y')
df[modify][is.na(df[modify])] <- 'no'
df
# x y c
#1 1 a 1
#2 2 no NA
#3 no b 5
I'm going to fix your code with as few changes as possible, so you can learn.
There are two big problems. First, the for loop needs to have curly braces {} around the lines you want to loop over. Second, if you want to reference variables in a data frame dynamically, you can't use the $ operator. You have to use double brackets [[]].
library(tidyr)
X <- data.frame(m = c(1, 2, NA), l = c("a", NA, "b"), h = c(1, NA, 5))
modify <- c("m","l","h")
for (i in seq_along(modify)) {
column <- modify[i]
X[[column]] <- as.character(X[[column]]) #X is my dataframe
X[[column]] <- X[[column]] %>% replace_na("no")
}
X
# m l h
# 1 1 a 1
# 2 2 no no
# 3 no b 5
You can do what you were trying to do much more efficiently, as shown in the other answers. But I wanted to show you how to do it the way you were trying to correct your understanding of for loops and the subset operator. These are basic things that everyone should understand when you are first learning R.
You might want to go through a beginners tutorial to solidify your understanding. I used tutorialspoint when I was first learning and found it useful.
We could do this efficiently with set from data.table
library(data.table)
setDT(X)
for(nm in modify) {
set(X, i = NULL, j= nm, value = as.character(X[[nm]]))
set(X, i = which(is.na(X[[nm]])), j = nm, value = 'no')
}
-output
> X
m l h i
1: 1 a 1 NA
2: 2 no no 5
3: no b 5 6
data
X <- data.frame(m = c(1, 2, NA), l = c("a", NA, "b"),
h = c(1, NA, 5), i = c(NA, 5, 6))
modify <- c("m","l","h")
I am trying to get away from loops in R and was looking to both vectorize and speed up a section of my code.
I am looking to convert a For loop using lapply, but am getting an error:
Reproducible example:
library(dplyr)
# This works using a For loop -----------------------------------
# create sample data frame
df <- data.frame(Date = rep(c("Jan1", "Jan2", "Jan3"), 3),
Item = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Value = 10:18)
diff <- numeric() # initialize
# Loop through each item and take difference of latest value from earlier values
for (myitem in unique(df$Item)) {
y = df[df$Date == last(df$Date) & df$Item == myitem, "Value"] # Latest value for an item
x = df[df$Item == myitem, "Value"] # Every value for an item
diff <- c(diff, y-x)
}
df_final <- mutate(df, Difference = diff)
df_final
I found related questions here (lapply), here (lapply), and here ($ operator) but none really helped me with my question.
Here is how I tried to vectorize using lapply:
# Same thing using vectorized approach ----------------------------------
mylist <- list(unique(df$Item))
myfunction <- function(df = df, diff = numeric()) {
y = df[df$Date == last(df$Date) & df$Item == mylist, "Value"] # Latest value for an item
x = df[df$Item == mylist, "Value"] # Every value for an item
diff <- c(diff, y-x)
}
# throws error
diff_vector <- unlist(lapply(mylist, myfunction))
df_final2 <- mutate(df, Difference = diff_vector)
df_final2
My real data set has hundreds of thousand of rows. If someone could point me in the right direction on how to vectorize this to get the same output as the For loop I would appreciate it.
Thanks!
So lapply isn't being used quite right here, that's all!
lapply applies a function to each element of a list. To be explicit, it takes each element of a list, and applies the function to that element.
So if you want it to apply a function to several subsets of a data frame, you need to get it a list which is several subsets of a data frame. So let's create that list first.
We can do this using the split function, it splits your data frame into several data frames based on a column and stores these as a list. A list of subsets of a data frame. Perfect!
So let's replace the line where you create mylist with this line instead.
mylist <- split(df,df[,c("Item")])
Now we just need to make some changes tomyfunction. Remember we're now passing through our data already subsetted, so we can remove the conditions about the Item matching with what we'd expect. Remember this function will get applied to each of these data frames in their entirety.
myfunction <- function(df = df, diff = numeric()) {
y = df[df$Date == last(df$Date), "Value"] # Latest value for an item
x = df[, "Value"] # Every value for an item
diff <- c(diff, y-x)
}
And the rest my friend, is exactly as you have it :)
I'm not sure lapply is the right approach to take. I'd stick with mutate - which you already seem to be using:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(Date = rep(c("Jan1", "Jan2", "Jan3"), 3),
Item = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Value = 10:18)
df <- df %>%
group_by(Item) %>%
mutate(diff = last(Value) - Value)
df
#> # A tibble: 9 x 4
#> # Groups: Item [3]
#> Date Item Value diff
#> <fct> <fct> <int> <int>
#> 1 Jan1 A 10 2
#> 2 Jan2 A 11 1
#> 3 Jan3 A 12 0
#> 4 Jan1 B 13 2
#> 5 Jan2 B 14 1
#> 6 Jan3 B 15 0
#> 7 Jan1 C 16 2
#> 8 Jan2 C 17 1
#> 9 Jan3 C 18 0
Created on 2018-06-27 by the reprex package (v0.2.0).
This does assume that the observations (at least within the "Item" group) are arranged in order. If not, add arrange(Date) %>% as a step after group_by
you could create a table with the latest value, join with the original table and get the difference or use data.table to create an additional column with latest value
library(data.table)
df <- data.frame(Date = rep(c("Jan1", "Jan2", "Jan3"), 3),
Item = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Value = 10:18)
setDT(df)
df[,latestVal:=last(Value),by=.(Item)][,diff:=latestVal-Value][,.(Date,Item,Value,diff)]
In R I want to create a boxplot over count data instead of raw data. So my table schema looks like
Value | Count
1 | 2
2 | 1
...
Instead of
Value
1
1
2
...
Where in the second case I could simply do boxplot(x)
I'm sure there's a way to do what you want with the already summarized data, but if not, you can abuse the fact that rep takes vectors:
> dat <- data.frame(Value = 1:5, Count = sample.int(5))
> dat
Value Count
1 1 1
2 2 3
3 3 4
4 4 2
5 5 5
> rep(dat$Value, dat$Count)
[1] 1 2 2 2 3 3 3 3 4 4 5 5 5 5 5
Simply wrap boxplot around that and you should get what you want. I'm sure there's a more efficient / better way to do that, but this should work for you.
I solved a similar issue recently by using the 'apply' function on each column of counts with the 'rep' function:
> datablock <- apply(countblock[-1], 2, function(x){rep(countblock$value, x)})
> boxplot(datablock)
...The above assumes that your values are in the first column and subsequent columns contain count data.
A combination of rep and data.frame can be used as an approach if another variable is needed for classification
Eg.
with(data.frame(v1=rep(data$v1,data$count),v2=(data$v2,data$count)),
boxplot(v1 ~ v2)
)
Toy data:
(besides Value and Count, I add a categorical variable Group)
set.seed(12345)
df <- data.frame(Value = sample(1:100, 100, replace = T),
Count = sample(1:10, 100, replace = T),
Group = sample(c("A", "B", "C"), 100, replace = T),
stringsAsFactors = F)
Use purrr::pmap and purrr::reduce to manipulate the data frame:
library(purrr)
data <- pmap(df, function(Value, Count, Group){
data.frame(x = rep(Value, Count),
y = rep(Group, Count))
}) %>% reduce(rbind)
boxplot(x ~ y, data = data)