Sum Column in R Studio with Alphabetical Name - r

I cannot successfully sum a column in R Studio from a database in SQL. I keep getting the error "Error in FUN: only defined on a data frame with all numeric variables".
Currently, I have:
newObject <- dataFrame %>% sum("COLUMN NAME", na.rm = FALSE)

The problem is that you're trying to pipe the entire dataFrame object into the sum function.
In essence, you're trying this:
newObject <- sum(dataFrame, "COLUMN NAME", na.rm = FALSE)
That isn't working because some of the values in your dataFrame are character. And if they aren't "COLUMN NAME" at the very least is a character string.
You might be looking for summarise, but other possibilities may be transmute or mutate:
mtcars %>%
summarise(Sum = sum(mpg, na.rm= FALSE))
# Sum
#1 642.9
mtcars %>%
transmute(Sum = sum(mpg, na.rm=FALSE))
# Sum
#1 642.9
#2 642.9
#...
mtcars %>%
mutate(Sum = sum(mpg, na.rm= FALSE))
# mpg cyl disp hp drat wt qsec vs am gear carb Sum
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 642.9
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 642.9
#...
Here mpg is the name of a column in mtcars. You can replace that with your column name, but without quotes.

Related

Dplyr: Conditionally rename multiple variables with regex by name

I need to rename multiple variables using a replacement dataframe. This replacement dataframe also includes regex. I would like to use a similar solution proposed here, .e.g
df %>% rename_with(~ newnames, all_of(oldnames))
MWE:
df <- mtcars[, 1:5]
# works without regex
replace_df_1 <- tibble::tibble(
old = df %>% colnames(),
new = df %>% colnames() %>% toupper()
)
df %>% rename_with(~ replace_df_1$new, all_of(replace_df_1$old))
# with regex
replace_df_2 <- tibble::tibble(
old = c("^m", "cyl101|cyl", "disp", "hp", "drat"),
new = df %>% colnames() %>% toupper()
)
old new
<chr> <chr>
1 ^m MPG
2 cyl101|cyl CYL
3 disp DISP
4 hp HP
5 drat DRAT
# does not work
df %>% rename_with(~ replace_df_2$new, all_of(replace_df_2$old))
df %>% rename_with(~ matches(replace_df_2$new), all_of(replace_df_2$old))
EDIT 1:
The solution of #Mael works in general, but there seems to be index issue, e.g. consider the following example
replace_df_2 <- tibble::tibble(
old = c("xxxx", "cyl101|cyl", "yyy", "xxx", "yyy"),
new = mtcars[,1:5] %>% colnames() %>% toupper()
)
mtcars[, 1:5] %>%
rename_with(~ replace_df_2$new, matches(replace_df_2$old))
Results in
mpg MPG disp hp drat
<dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9
meaning that the rename_with function correctly finds the column, but replaces it with the first item in the replacement column. How can we tell the function to take the respective row where a replacement has been found?
So in this example (edit 1), I only want to substitute the second column with "CYL", the rest should be left untouched. The problem is that the function takes the first replacement (MPG) instead of the second (CYL).
Thank you for any hints!
matches should be on the regex-y column:
df %>%
rename_with(~ replace_df_2$new, matches(replace_df_2$old))
MPG CYL DISP HP DRAT
Mazda RX4 21.0 6 160.0 110 3.90
Mazda RX4 Wag 21.0 6 160.0 110 3.90
Datsun 710 22.8 4 108.0 93 3.85
Hornet 4 Drive 21.4 6 258.0 110 3.08
Hornet Sportabout 18.7 8 360.0 175 3.15
Valiant 18.1 6 225.0 105 2.76
#...
If the task is simply to set all col names to upper-case, then this works:
sub("^(.+)$", "\\U\\1", colnames(df), perl = TRUE)
[1] "MPG" "CYL" "DISP" "HP" "DRAT"
In dplyr:
df %>%
rename_with( ~sub("^(.+)$", "\\U\\1", colnames(df), perl = TRUE))
I found a solution using the idea of non standard evaluation from this question and #Maƫl's answer.
Using map_lgl we create a logical vector that returns TRUE if the column in replace_df_2$old can be found inside the dataframe df. Then we pass this logical vector to replace_df_2$new to get the correct replacement.
df <- mtcars[, 1:5]
df %>%
rename_with(.fn = ~replace_df_2$new[map_lgl(replace_df_2$old,~ any(str_detect(., names(df))))],
.cols = matches(replace_df_2$old))
Result:
mpg CYL disp hp drat
Mazda RX4 21.0 6 160.0 110 3.90

Order columns from a list of pre-defined names and ignore column names which don't exist in the list

I want to order a data.table by using a set of predefined names available in a list.
For example:
library(data.table)
dt <- as.data.table(mtcars)
list_name <-c("mpg", "disp", "xyz")
#Order columns
setcolorder(dt, list_name) #requirement: if "xyz" column doesn't exist it should ignore and take the rest
The use case case is that there are multiple data.tables that are getting created and all of them have column names from a list of names. There can be missing column names in some data but the data needs to be ordered as per a list.
output:
dt
disp wt mpg cyl hp drat qsec vs am gear carb
1: 160.0 2.620 21.0 6 110 3.90 16.46 0 1 4 4
2: 160.0 2.875 21.0 6 110 3.90 17.02 0 1 4 4
3: 108.0 2.320 22.8 4 93 3.85 18.61 1 1 4 1
An option is to load all of them in a list and then use setcolorder by looping over the list with lapply and use intersect on the names of the dataset while ordering
lst1 <- list(dt, dt)
lst1 <- lapply(lst1, function(x) setcolorder(x, intersect(list_name, names(x)))
If we need to reuse, create a function
f1 <- function(dat, nm1) {
setcolorder(dat, intersect(nm1, names(dat)))
}
f1(dt, list_name)
f1(dt2, list_name)

Use pipes in R to set data

Is it possible to use the pipe Operator in R (not to get) but to set data?
Lets say i want to modify the first row of mtcars dataset and set the value of qsec to 99.
Traditional way:
mtcars[1, 7] <- 99
Is that also possible using the pipe Operator?
mtcars %>% filter(qsec == 16.46) %>% select(qsec) <- 99
If we are in a state where the chain is absolute necessary or curious to know whether <- can be applied in a chain
library(magrittr)
mtcars %>%
`[<-`(1, 7, 99) %>%
head(2)
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21 6 160 110 3.9 2.620 99.00 0 1 4 4
#Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Also, inset (from the comments) is an alias for [<-
mtcars %>%
inset(1, 7, 99) %>%
head(2)

Using the dot operator in dplyr::bind_cols

I'm seeing some unexpected behavior with dplyr. I have a specific use case but I will setup a dummy problem to illustrate my point. Why does this work,
library(dplyr)
temp <- bind_cols(mtcars %>% select(-mpg), mtcars %>% select(mpg))
head(temp)
cyl disp hp drat wt qsec vs am gear carb mpg
6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0
6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.0
But not this,
library(dplyr)
temp <- mtcars %>% bind_cols(. %>% select(-mpg), . %>% select(mpg))
Error in cbind_all(x) : Argument 2 must be length 1, not 32
Thanks for the help.
You need to wrap your function with {} to pipe mtcars into a function within another function like the following:
library(dplyr)
temp1 = mtcars %>% {bind_cols(select(., -mpg), select(., mpg))}
temp2 = bind_cols(mtcars %>% select(-mpg), mtcars %>% select(mpg))
# > identical(temp1, temp2)
# [1] TRUE
Another solution:
myfun <- function(x) {
bind_cols(x %>% select(-mpg), x %>% select(mpg))
}
temp <- mtcars %>% myfun

Selecting columns in R data frame based on those *not* in a vector

I'm familiar with being able to extract columns from an R data frame (or matrix) like so:
df.2 <- df[, c("name1", "name2", "name3")]
But can one use a ! or other tool to select all but those listed columns?
For background, I have a data frame with quite a few column vectors and I'd like to avoid:
Typing out the majority of the names when I could just remove a minority
Using the much shorter df.2 <- df[, c(1,3,5)] because when my .csv file changes, my code goes to heck since the numbering isn't the same anymore. I'm new to R and think I've learned the hard way not to use number vectors for larger df's that might change.
I tried:
df.2 <- df[, !c("name1", "name2", "name3")]
df.2 <- df[, !=c("name1", "name2", "name3")]
And just as I was typing this, found out that this works:
df.2 <- df[, !names(df) %in% c("name1", "name2", "name3")]
Is there a better way than this last one?
An alternative to grep is which:
df.2 <- df[, -which(names(df) %in% c("name1", "name2", "name3"))]
You can make a shorter call that is also more generalizable with negative-grep:
df.2 <- df[, -grep("^name[1:3]$", names(df) )]
Since grep returns numerics you can use the negative vector indexing to remove columns. You could add further number or more complex patterns.
dplyr::select() has several options for dropping specific columns:
library(dplyr)
drop_columns <- c('cyl','disp','hp')
mtcars %>%
select(-one_of(drop_columns)) %>%
head(2)
mpg drat wt qsec vs am gear carb
Mazda RX4 21 3.9 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21 3.9 2.875 17.02 0 1 4 4
Negating specific column names, the following drops the column "hp" and the columns from "qsec" through "gear":
mtcars %>%
select(-hp, -(qsec:gear)) %>%
head(2)
mpg cyl disp drat wt carb
Mazda RX4 21 6 160 3.9 2.620 4
Mazda RX4 Wag 21 6 160 3.9 2.875 4
You could also negate contains(), starts_with(), ends_with(), or matches():
mtcars %>%
select(-contains('t')) %>%
select(-starts_with('a')) %>%
select(-ends_with('b')) %>%
select(-matches('^m.+g$')) %>%
head(2)
cyl disp hp qsec vs gear
Mazda RX4 6 160 110 16.46 0 4
Mazda RX4 Wag 6 160 110 17.02 0 4
Old thread, but here's another solution:
df.2 <- subset(df, select=-c(name1, name2, name3))
This was posted in another similar thread (though I can't find it right now). Should be sustainable code in the situation you describe, and is probably easier to read and edit than some of the other options.
You could make a custom function to do this if you're using it for your own use to manipulate data. I may do something like this:
rm.col <- function(df, ...) {
x <- substitute(...())
z <- Trim(unlist(lapply(x, function(y) as.character(y))))
df[, !names(df) %in% z]
}
rm.col(mtcars, hp, mpg)
The first argument is the dataframe name. the following ... are the names of any columns you wish to remove.
The easiest way that comes to my mind:
filtered_df<-df[, setdiff(names(df),c("name1","name2") ]
essentially you are computing the set difference between full list of column names and the subset you want to filter out (name1 and name2 above).

Resources