I'm searching a better, more efficient solution to subtract a vector from each row of a dataframe (df1). My current solution repeats the vector (Vec) to create a dataframe (Vec_df1) with similar length as the df1 and then subtracts the two dataframes. Now I wonder if there is a more "direct" way to do this without having to create the new Vec_df1 dataframe (preferably in tidyverse). See example data below.
#Example data
V1 <- c(1, 2, 3)
V2 <- c(4, 5, 6)
V3 <- c(7, 8, 9)
df1 <- tibble(V1, V2, V3)
Vec <- c(1, 1, 2)
# Current solution, creates a dataframe with the same nrows by repeating the vector.
Vec_df1 <- tibble::as_tibble(t(Vec)) %>%
dplyr::slice(rep(dplyr::row_number(), nrow(df1)))
# Subtraction.
df2 <- df1-Vec_df1
df2
Thanks in advance
We can use sweep :
sweep(df1, 2, Vec, `-`)
# `-` is default FUN in sweep so you can also use
#sweep(df1, 2, Vec)
# V1 V2 V3
#1 0 3 5
#2 1 4 6
#3 2 5 7
Or an attempt similar to yours
df1 - rep(Vec, each = nrow(df1))
A similar approach using map2_df():
library(purrr)
map2_df(df1, Vec, `-`)
# A tibble: 3 x 3
V1 V2 V3
<dbl> <dbl> <dbl>
1 0 3 5
2 1 4 6
3 2 5 7
the fastest way to do this :
as_tibble(t(t(df1) - Vec))
# A tibble: 3 x 3
V1 V2 V3
<dbl> <dbl> <dbl>
1 0 3 5
2 1 4 6
3 2 5 7
We can also do
df1 - Vec[col(df1)]
Related
I am working with an SPSS file that has been exported as tab delimited. In SPSS, you can set values to represent different types of missing and the dataset has 98 and 99 to indicate missing.
I want to convert them to NA but only in certain columns (V2 and V3 in the example data, leaving V1 and V4 unchanged).
library(dplyr)
testdf <- data.frame(V1 = c(1, 2, 3, 4),
V2 = c(1, 98, 99, 2),
V3 = c(1, 99, 2, 3),
V4 = c(98, 99, 1, 2))
outdf <- testdf %>%
mutate(across(V2:V3), . = ifelse(. %in% c(98,99), NA, .))
I haven't used across before and cannot work out how to have the mutate return the ifelse into the same columns. I suspect I am overthinking this, but can't find any similar examples that have both across and ifelse. I need a tidyverse answer, prefer dplyr or tidyr.
You need the syntax to be slightly different to make it work. Check ?across for more info.
You need to use a ~ to make a valid function (or use \(.), or use function(.)),
You need to include the formula in the across function
library(dplyr)
testdf %>%
mutate(across(V2:V3, ~ ifelse(. %in% c(98,99), NA, .)))
# V1 V2 V3 V4
# 1 1 1 1 98
# 2 2 NA NA 99
# 3 3 NA 2 1
# 4 4 2 3 2
Note that an alternative is replace:
testdf %>%
mutate(across(V2:V3, ~ replace(., . %in% c(98,99), NA)))
Base R option using lapply with an ifelse like this:
cols <- c("V2","V3")
testdf[,cols] <- lapply(testdf[,cols],function(x) ifelse(x %in% c(98,99),NA,x))
testdf
#> V1 V2 V3 V4
#> 1 1 1 1 98
#> 2 2 NA NA 99
#> 3 3 NA 2 1
#> 4 4 2 3 2
Created on 2022-10-19 with reprex v2.0.2
Base R:
cols <- c("V2", "V3")
testdf[, cols ][ testdf[, cols ] > 97 ] <- NA
How do I add a column to a data frame consisting of the minimum values from other columns? So in this case, to create a third column that will have the values 1, 2 and 2?
df = data.frame(A = 1:3, B = 4:2)
You can use apply() function to do this. See below.
df$C <- apply(df, 1, min)
The second argument allows you to choose the dimension in which you want min to be applied, in this case 1, applies min to all columns in each row separately.
You can choose specific columns from the dataframe, as follows:
df$newCol <- apply(df[c('A','B')], 1, min)
You can call the parallel minimum function with do.call to apply it on all your columns:
df$C <- do.call(pmin, df)
df %>%
rowwise() %>%
mutate(C = min(A, B))
# A tibble: 3 × 3
# Rowwise:
A B C
<int> <int> <int>
1 1 4 1
2 2 3 2
3 3 2 2
Using input with equal values across rows:
df = data.frame(A = 1:10, B = 11:2)
df %>%
rowwise() %>%
mutate(C = min(A, B))
# A tibble: 10 × 3
# Rowwise:
A B C
<int> <int> <int>
1 1 11 1
2 2 10 2
3 3 9 3
4 4 8 4
5 5 7 5
6 6 6 6
7 7 5 5
8 8 4 4
9 9 3 3
10 10 2 2
You do simply:
df$C <- apply(FUN=min,MARGIN=1,X=df)
Or:
df[, "C"] <- apply(FUN=min,MARGIN=1,X=df)
or:
df["C"] <- apply(FUN=min,MARGIN=1,X=df)
Instead of apply, you could also use data.farme(t(df)), where t transposes df, because sapply would traverse a data frame column-wise applying the given function. So the rows must be made columns. Since t outputs always a matrix, you need to make it a data.frame() again.
df$C <- sapply(data.frame(t(df)), min)
Or one could use the fact that ifelse is vectorized:
df$C <- with(df, ifelse(A<B,A,B))
Or:
df$C <- ifelse(df$A < df$B, df$A, df$B)
matrixStats
# install.packages("matrixStats")
matrixStats::rowMins(as.matrix(df))
According to this SO answer the fastest.
apply-type functions use lists and are always quite slow.
You can use transform() to add the min column as the output of pmin(a, b) and access the elements of df without indexing:
df <- transform(df, min = pmin(a, b))
or
In data.table
library(data.table)
DT = data.table(a = 1:3, b = 4:2)
DT[, min := pmin(a, b)]
I don't think this exact question has been asked yet (for R, anyway).
I want to retain any columns in my dataset (there are hundreds in actuality) that contain a certain string, and drop the rest. I have found plenty of examples of string searching column names, but nothing for the contents of the columns themselves.
As an example, say I have this dataset:
df = data.frame(v1 = c(1, 8, 7, 'No number'),
v2 = c(5, 3, 5, 1),
v3 = c('Nothing', 4, 2, 9),
v4 = c(3, 8, 'Something', 6))
For this example, say I want to retain any columns with the string No, so that the resulting dataset is:
v1 v3
1 1 Nothing
2 8 4
3 7 2
4 No number 9
How can I do this in R? I am happy with any sort of solution (e.g., base R, dplyr, etc.)!
Thanks in advance!
Simply
df[grep("No", df)]
# v1 v3
# 1 1 Nothing
# 2 8 4
# 3 7 2
# 4 No number 9
This works, because grep internally checks if if (!is.character(x)) and if that's true it basically does:
s <- structure(as.character(df), names = names(df))
s
# v1
# "c(\"1\", \"8\", \"7\", \"No number\")"
# v2
# "c(5, 3, 5, 1)"
# v3
# "c(\"Nothing\", \"4\", \"2\", \"9\")"
# v4
# "c(\"3\", \"8\", \"Something\", \"6\")"
grep("No", s)
# [1] 1 3
Note:
R.version.string
# [1] "R version 4.0.3 (2020-10-10)"
Base R :
df[colSums(sapply(df, grepl, pattern = 'No')) > 0]
# v1 v3
#1 1 Nothing
#2 8 4
#3 7 2
#4 No number 9
Using dplyr :
library(dplyr)
df %>% select(where(~any(grepl('No', .))))
Use dplyr::select_if() function:
df <- df %>% select_if(function(col) any(grepl("No", col)))
You can run grepl for each column and if there's any value in there, pick it.
df = data.frame(v1 = c(1, 8, 7, 'No number'),
v2 = c(5, 3, 5, 1),
v3 = c('Nothing', 4, 2, 9),
v4 = c(3, 8, 'Something', 6))
find.no <- sapply(X = df, FUN = function(x) {
any(grep("No", x = x))
})
> df[, find.no]
v1 v3
1 1 Nothing
2 8 4
3 7 2
4 No number 9
Currently, I have a database built in R that looks like this:
df <- data.frame(c('ABC','DEF','HIJ'),
c(1,2,5),
c(2,5,9),
c(14,19,12))
And I have a function which searches for one value across the entire data frame and returns the entire row, the function for this is below:
df[which(df == 5,
arr.ind = TRUE)[,"row"],]
This function returns the following when executed:
HIJ 5 9 12
DEF 2 5 19
I would like to be able to enter a list of values as a vector and then filter through all the values in one shot using a loop to return values that have a match, however, I have been totally lost in creating a loop function with my search function above to find values from a vector in my dataset. Below is an example of what I am trying to achieve, by searching for values from vector v across data frame df to return all rows of df which have values in any column or row that are the same as values in v:
v <- c(1,2,13,19,16,120,2934,1087)
Searching this across the data frame I would like to return:
HIJ 5 9 12
DEF 2 5 19
I am wondering what would be the best way to perform a loop to do this search?
We can use :
df[rowSums(sapply(df, `%in%`, v)) > 0, ]
Or using dplyr :
library(dplyr)
df %>% filter_all(any_vars(. %in% v))
It may be easier to reshape your data first. I'll use data.table::melt:
library(data.table)
df = data.frame(
V1 = c("ABC", "DEF", "HIJ"),
V2 = c(1, 2, 5),
V3 = c(2, 5, 9),
V4 = c(14, 19, 12)
)
setDT(df)
# reshape long
melt_df = melt(df, id.vars = 'V1')
melt_df
# V1 variable value
# 1: ABC V2 1
# 2: DEF V2 2
# 3: HIJ V2 5
# 4: ABC V3 2
# 5: DEF V3 5
# 6: HIJ V3 9
# 7: ABC V4 14
# 8: DEF V4 19
# 9: HIJ V4 12
Now we can look it all up at once:
melt_df[value %in% v]
# V1 variable value
# 1: ABC V2 1
# 2: DEF V2 2
# 3: ABC V3 2
# 4: DEF V4 19
That's the gist of it. To get back your original desired output, we need to do some other steps:
df[.(V1 = melt_df[value %in% v, unique(V1)]), on = 'V1']
# V1 V2 V3 V4
# 1: ABC 1 2 14
# 2: DEF 2 5 19
this pulls the associated values of V1 from melt_df (unique removes duplicates) and joins them back to df (hence on='V1') to get the associated rows from df
What is the best way to convert a specific column in each list object to a specific format?
For instance, I have a list with four objects (each of which is a data frame) and I want to change column 3 in each data.frame from double to integer?
I'm guessing something along the line of lapply but I didn't know what specific synthax to use. I was trying:
lapply(df,function(x){as.numeric(var1(x))})
but it wasn't working.
Thanks!
Yes, lapply works well here:
lapply(listofdfs, function(df) { # loop through each data.frame in list
df[ , 3] <- as.integer(df[ , 3]) # make the 3rd column of type integer
df # return the new data.frame
})
This is just an alternative to C. Braun's answer.
You can also use map() function from the purr library.
Input:
library(tidyverse)
df <- tibble(a = c(1, 2, 3), b =c(4, 5, 6), d = c(7, 8, 9))
myList <- list(df, df, df)
myList
Method:
map(myList, ~(.x %>% mutate_at(vars(3), funs(as.integer(.)))))
Output:
[[1]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[2]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[3]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
You can use this:
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
Simple example:
data <- data.frame(cbind(c("1","2","3","4",NA),c(1:5)),stringsAsFactors = F)
typeof(data[,1]) #character
dlist <- list(data,data,data)
coltochange <- 1
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
typeof(dlist[[1]][,1]) #character
typeof(dlist2[[1]][,1]) #double