my dataframe looks like this.
df <- read.table(text="
column1 column2 column3
1 3 2 1
1 3 2 1
", header=TRUE)
I need to subtract last 2 columns from first. For counting that columns I´d use rowSums(summary[,1:3]) but I don´t know how to subtract this columns. Note that I can´t just write my code like this since I don´t know column names.
`result <- df %>%
mutate(result = rowSums(column1, - column2, - column3))`
We can subset the data to remove the first column (.[-1]), get the rowSums and subtract from 'column1'
library(tidyverse)
df %>%
mutate(result = column1 - rowSums(.[-1]))
# column1 column2 column3 result
#1 3 2 1 0
#2 3 2 1 0
If there are more columns and want to select the last two columns
df %>%
mutate(result = column1 - rowSums(.[tail(names(.), 2)]))
If we have only the index of the columns that are involved in the operation
df %>%
mutate(result = .[[1]] - rowSums(.[c(2, 3)]))
data
df <- structure(list(column1 = c(3L, 3L), column2 = c(2L, 2L), column3 = c(1L,
1L)), class = "data.frame", row.names = c(NA, -2L))
Related
I have 5 columns with numerical data and I would like to filter for rows that match a data range in at least 3 of the 5 columns.
For example i have the following data frame and I define a value range of 5-10.
My first row has 3 columns with values between 5 and 10, so i want to keep that row.
The second row only has 2 values between 5 and 10, so I want to remove it.
column1
column2
column3
column4
column5
7
4
10
9
2
4
8
2
6
2
First test if values in columns are greater or equal 5 and less or equal than 10, then look for rows with 3 or more that fit the condition.
dat[ rowSums( dat >= 5 & dat <= 10 ) >= 3, ]
column1 column2 column3 column4 column5
1 7 4 10 9 2
Data
dat <- structure(list(column1 = c(7L, 4L), column2 = c(4L, 8L), column3 = c(10L,
2L), column4 = c(9L, 6L), column5 = c(2, 2)), class = "data.frame", row.names = c(NA,
-2L))
I'd like to share a second approach:
# Setting up data
my_df <- tibble::tibble(A = c(7,4), B = c(4,8), C = c(10, 2), D = c(9,6), E = c(2,2), X = c("some", "character"))
my_min <- 5
my_max <- 10
Then do some tidyverse-magic:
# This is verbose, but shows clearly all the steps involved:
my_df_filtered <- my_df %>%
dplyr::mutate(n_cols_in_range = dplyr::across(where(is.numeric), ~ .x >= my_min & .x <= my_max)
) %>%
dplyr::rowwise() %>%
dplyr::mutate(n_cols_in_range = sum(n_cols_in_range, na.rm = TRUE)
) %>%
dplyr::filter(n_cols_in_range >= 3
) %>%
dplyr::select(-n_cols_in_range)
The above is equivalent to:
my_df_filtered <- my_df %>%
dplyr::rowwise() %>%
dplyr::filter(sum(dplyr::across(where(is.numeric), ~ .x >= my_min & .x <= my_max), na.rm = TRUE) >= 3)
But I must state, that the above answer is clearly more elegant since it only needs 1 line of code!
I have a data.frame (df1) and I want to include a single, most recent age for each of my samples from another data.frame (df2):
df1$age <- df2$age_9[match(df1$Sample_ID, df2$Sample_ID)]
The problem is that in df2 there are 9 columns for age, as each one indicates the age at a specific check-up date (age_1 is from the first visit, age_9 is the age at the 9th visit) and patients dont make all their visits.
How do I add the most recently obtained age from a non empty check up date?
aka, if age_9 == "." replace "." with age_8 then if age_8 == "." replace "." with age_7 ... etc
From this:
View(df1)
Sample Age
1 50
2 .
3 .
To:
View(df1)
Sample Age
1 50
2 49
3 30
From the data df2
View(df2)
Sample Age_1 Age_2 Age_3
1 40 42 44
2 35 49 .
3 30 . .
This is my attempt:
df1$age[which(df1$age == ".")] <- df2$age_8[match(df1$Sample_ID, df2$Sample_ID)]
With base R, we can use max.col to return the last column index for each row, where the 'Age' columns are not ., cbind with sequence of rows to return a row/column index, extract the elements and change the 'Age' column in 'df1', where the 'Age' is .
df1$Age <- ifelse(df1$Age == ".", df2[-1][cbind(seq_len(nrow(df2)),
max.col(df2[-1] != ".", "last"))], df1$Age)
df1 <- type.convert(df1, as.is = TRUE)
-output
df1
# Sample Age
#1 1 50
#2 2 49
#3 3 30
or using tidyverse by reshaping into 'long' format and then do a join after sliceing the last row grouped by 'Sample'
library(dplyr)
library(tidyr)
df2 %>%
mutate(across(starts_with('Age'), as.integer)) %>%
pivot_longer(cols = starts_with('Age'), values_drop_na = TRUE) %>%
group_by(Sample) %>%
slice_tail(n = 1) %>%
ungroup %>%
select(-name) %>%
right_join(df1) %>%
transmute(Sample, Age = coalesce(as.integer(Age), value))
-output
# A tibble: 3 x 2
# Sample Age
# <int> <int>
#1 1 50
#2 2 49
#3 3 30
data
df1 <- structure(list(Sample = 1:3, Age = c("50", ".", ".")),
class = "data.frame",
row.names = c(NA,
-3L))
df2 <- structure(list(Sample = 1:3, Age_1 = c(40L, 35L, 30L), Age_2 = c("42",
"49", "."), Age_3 = c("44", ".", ".")), class = "data.frame",
row.names = c(NA,
-3L))
I'm looking for an efficient way to rename several columns.
I have a dataframe that looks like the following.
id sdf dir fki
1 3 4 2
2 5 2 1
3 4 1 2
I want to rename columns sdf, dir, and fki.
I know I could do so like this:
df <- df %>%
rename(newname1 = sdf,
newname2 = dir,
newname3 = fki)
With the amount of columns I have, it is taking a long time to type the names of the columns I would like to replace.
Ideally, I would like to create a vector with names:
newcolumns <- c("newname1", "newname2", "newname3")
And then specify that these should replace the column names in the dataframe, starting with column sdf. Is there a way to do this?
We can use rename_at
library(dplyr)
df %>%
rename_at(vars(-id), ~ newcolumns)
-output
# id newname1 newname2 newname3
#1 1 3 4 2
#2 2 5 2 1
#3 3 4 1 2
Or with rename_with
df %>%
rename_with(~ newcolumns, -id)
Or pass a named vector and use !!! in rename
df %>%
rename(!!! setNames(names(df)[-1], newcolumns))
Or using base R
names(df)[-1] <- newcolumns
data
df <- structure(list(id = 1:3, sdf = c(3L, 5L, 4L), dir = c(4L, 2L,
1L), fki = c(2L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
I'm currently using group_by then slice, to get the maximum dates in my data. There are a few rows where the date is NA, and when using slice(which.max(END_DT)), the NAs end up getting dropped. Is there an equivalent of summarise_all, so that I can keep the NAs in my data?
ID Date INitials
1 01-01-2020 AZ
1 02-01-2020 BE
2 NA CC
I'm using
df %>%
group_by(ID) %>%
slice(which.max(Date))
I need the final results to look like below, but it's dropping the NA entirely
ID Date Initials
1 02-01-2020 BE
2 NA CC
which.max() is not suitable in this case because (1) it drops missing values and (2) it only finds the first position of maxima. Here is a general solution:
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%m-%d-%Y")) %>%
group_by(ID) %>%
filter(Date == max(Date) | all(is.na(Date)))
# # A tibble: 2 x 3
# # Groups: ID [2]
# ID Date INitials
# <int> <date> <fct>
# 1 1 2020-02-01 BE
# 2 2 NA CC
df <- structure(list(ID = c(1L, 1L, 2L), Date = structure(c(1L, 2L,
NA), .Label = c("01-01-2020", "02-01-2020"), class = "factor"),
INitials = structure(1:3, .Label = c("AZ", "BE", "CC"), class = "factor")),
class = "data.frame", row.names = c(NA, -3L))
It's dropping the NA because you're asking it to find the max date...which NA would not fall into. If you want to go the "which.max" route, then I'd just run the dataset again, using filter, and grab the NA(s) and bind them to the dataset.
df.1 <- df%>%
filter(is.na(Date))
df <- rbind(df, df.1)
I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3