I'm looking for mutate_if() from R's dplyr. For example, how could I select the Int64 columns and convert them to Float?
using DataFrames
df = DataFrame(A = [72, 38, 54],
B = [1, 2, 3],
C = ["red", "blue", "green"])
# convert integer columns to decimal columns without selecting them by name
df
Here it is. The code converts columns with any integer number types (such as Int64, Int32) to a Float64.
for col in findall(x -> x <:Integer, eltypes(df))
df[!, col] = Float64.(df2[!, col])
end
Related
I have a data frame df with a column X where we have three different variables a, b and c as characters. For example
df$X <- data.frame(X = c(a,a,a,b,b,c,c,c,c), Y = ....)
I want to transform it into a = 1, b = 2 and c = 3 as numerics.
I first tried
df$X = as.factor(df$X)
transform(df, X = as.numeric(X))
where now I have a factor with three levels and a=1, b=2 and c=3. However the problem is that I need the column X as numeric. If I try
transform(df, X = as.numeric(as.character(X)))
or
transform(df, X = as.numeric(levels(X))[X])
I get NA for all the inputs (a, b, c).
How can I get the column X with numeric 1, 2, 3?
The solution of #jay.sf with encoding the characters first as a factor is quite elegant, because it generalizes to aribitrary strings and not just single characters.
If the codes are single characters, there is another possible solution, which uses the builtin constant letters and returns the position therein:
sapply(df$X, function(x) {which(x == letters)})
I have a dataframe in R. Either the df has two columns with values, or it has the dimensions 0,0.
If the dataframe has columns with values, I want to keep these values, but if the dimensions are 0,0, I want to create two columns with one row containing 0,0 values.
Either it looks like this:
start = c (2, 10, 20)
end = c(5, 15, 25)
dataframe = as.data.frame (cbind (start, end))
-> if the df looks like this, it should be retained
Or like this:
start = c ()
end = c()
dataframe = as.data.frame (cbind (start, end))
-> if the df looks like this, a row with (0,0) should be added in the first row.
I tried ifelse
dataframe_new = ifelse (nrow(dataframe) == 0, cbind (start = 0,end =0) , dataframe)
But if the dataframe is not empty, it remains only the value of the first row and column. If the dataframe is empty, there is only one 0.
Instead of the function ifelse, you should be using if...else clauses here. ifelse is used when you want to create a vector whose elements vary conditionally according to a logical vector with the same length as the output vector. That's not what you have here, but rather a simple branching condition (if the data frame has zero rows do one thing, if not then do something else). This is when you use if(condition) {do something} else {do another thing}
If you need to use the code a lot, you could save yourself time by putting it in a simple function:
fill_empty <- function(x) {
if(nrow(x) == 0) data.frame(start = 0, end = 0) else x
}
Let's test this on your examples:
start = c (2, 10, 20)
end = c(5, 15, 25)
dataframe = as.data.frame (cbind (start, end))
fill_empty(dataframe)
#> start end
#> 1 2 5
#> 2 10 15
#> 3 20 25
And
start = c ()
end = c()
dataframe = as.data.frame (cbind (start, end))
fill_empty(dataframe)
#> start end
#> 1 0 0
Created on 2022-09-19 with reprex v2.0.2
I wish to gives values in a vector names. I know how to do that but in this case I have many names and many values, both within vectors within lists, and typing them by hand would by suicide.
This method:
> values <- c('jessica' = 1, 'jones' = 2)
> values
jessica jones
1 2
obviously works. However, this method:
> names <- c('jessica', 'jones')
> values <- c(names[1] = 1, names[2] = 2)
Error: unexpected '=' in "values <- c(names[1] ="
Well... I cannot understand why R refuses to read these as pure characters to assign them as names.
I realize I can create values and names separately and then assign names as names(values) but again, my actual case is far more complex. But really I would just like to know why this particular issue occurs.
EDIT I: The ACTUAL data I have is a list of vectors, each is a different combination of amounts of ingredients, and then a giant vector of ingredient names. I cannot just set the name vector as names, because the individual names need to be placed by hand.
EDIT II: Example of my data structure.
ingredients <- c('ing1', 'ing2', 'ing3', 'ing4') # this vector is much longer in reality
amounts <- list(c('ing1' = 1, 'ing2' = 2, 'ing4' = 3),
c('ing2' = 2, 'ing3' = 3),
c('ing1' = 12, 'ing2' = 4, 'ing3' = 3),
c('ing1' = 1, 'ing2' = 1, 'ing3' = 2, 'ing4' = 5))
# this list too is much longer
I could type each numeric value's name individually as presented, but there are many more, and so I tried instead to input the likes of:
c(ingredients[1] = 1, ingredients[2] = 2, ingredients[4] = 3)
But this throws an error:
Error: unexpected '=' in "amounts <- list(c(ingredients[1] ="
We can use setNames
setNames(1:2, names)
Another option is deframe if we have a two column dataset
library(tibble)
tibble(names, val = 1:2) %>%
deframe
I want to create multiple lag variables for a column in a data frame for a range of values. I have code that successfully does what I want but is not scalable for what I need (hundreds of iterations)
I have code below that successfully does what I want but is not scalable for what I need (hundreds of iterations)
Lake_Lag <- Lake_Champlain_long.term_monitoring_1992_2016 %>%
group_by(StationID,Test) %>%
arrange(StationID,Test,VisitDate) %>%
mutate(lag.Result1 = dplyr::lag(Result, n = 1, default = NA))%>%
mutate(lag.Result5 = dplyr::lag(Result, n = 5, default = NA))%>%
mutate(lag.Result10 = dplyr::lag(Result, n = 10, default = NA))%>%
mutate(lag.Result15 = dplyr::lag(Result, n = 15, default = NA))%>%
mutate(lag.Result20 = dplyr::lag(Result, n = 20, default = NA))
I would like to be able to use a list c(1,5,10,15,20) or a range 1:150 to create lagging variables for my data frame.
Here's an approach that makes use of some 'tidy eval helpers' included in dplyr that come from the rlang package.
The basic idea is to create a new column in mutate() whose name is based on a string supplied by a for-loop.
library(dplyr)
grouped_data <- Lake_Champlain_long.term_monitoring_1992_2016 %>%
group_by(StationID,Test) %>%
arrange(StationID,Test,VisitDate)
for (lag_size in c(1, 5, 10, 15, 20)) {
new_col_name <- paste0("lag_result_", lag_size)
grouped_data <- grouped_data %>%
mutate(!!sym(new_col_name) := lag(Result, n = lag_size, default = NA))
}
The sym(new_col_name) := is a dynamic way of writing lag_result_1 =, lag_result_2 =, etc. when using functions like mutate() or summarize() from the dplyr package.
We can use shift from data.table, which can take take multiple valuees for n. According to ?shift
n - Non-negative integer vector denoting the offset to lead or lag the input by. To create multiple lead/lag vectors, provide multiple values to n
Convert the 'data.frame' to 'data.table' (setDT), order by 'StationID', 'Test', 'VisitDate' in i, grouped by 'StationID', 'Test'), get the lag (default type of shift is "lag") of 'Result' with n as a vector of values, and assign (:=) the output to a vector of columns names (created with paste0)
library(data.table)
i1 <- c(1, 5, 10, 15, 20)
setDT(Lake_Champlain_long.term_monitoring_1992_2016)[order(StationID,
Test, VisitDate), paste0("lag.Result", i) := shift(Result, n= i),
by = .(StationID, Test)][]
NOTE: Showed a much efficient solution
It is a follow-up question to this one. What I would like to check is if any column in a data frame contain the same value (numerical or string) for all rows. For example,
sample <- data.frame(col1=c(1, 1, 1), col2=c("a", "a", "a"), col3=c(12, 15, 22))
The purpose is to inspect each column in a data frame to see which column does not have identical entry for all rows. How to do this? In particular, there are both numbers as well as strings.
My expected output would be a vector containing the column number which has non-identical entries.
We can use apply columnwise (margin = 2) and calculate unique values in the column and select the columns which has number of unique values not equal to 1.
which(apply(sample, 2, function(x) length(unique(x))) != 1)
#col3
# 3
The same code can also be done using sapply or lapply call
which(sapply(sample, function(x) length(unique(x))) != 1)
#col3
# 3
A dplyr version could be
library(dplyr)
sample %>%
summarise_all(funs(n_distinct(.))) %>%
select_if(. != 1)
# col3
#1 3
We can use Filter
names(Filter(function(x) length(unique(x)) != 1, sample))
#[1] "col3"