I have the following two data frames:
letters <- LETTERS[seq(from = 1, to = 5)]
values <- rnorm(5, mean = 50)
df1 <- data.frame(letters, values)
category <- sample(LETTERS[1:5], 20, replace = TRUE)
numbers <- rnorm(20, mean = 100)
df2 <- data.frame(category, numbers)
I want to create a new column in df2 that takes the value in df2$numbers and subtracts the value in df1$values based on the matching letter.
In other words, if the value for "C" in df1 is 49.2, I want to subtract 49.2 from every row in df2$numbers where df$category equals "C". Hope that makes sense. Thanks for the help!
With dplyr:
df <- full_join(df1, df2, by = c('letters' = 'category')) %>%
mutate(diff = numbers - values)
Related
consider my labelled df1 below
This is my second dataframe df2
I want to change item column in df2 such that if its rows contains any names of df1, that string is replaced by the column label like below
any approach to achieve this is highly appreciated.
library(Hmisc)
library(dplyr)
df1 <- data.frame(low = rep(1,3),
med = rep(2,3),
high = rep(3,3),
other = rep(0,3))
label(df1$low) <- "is it low"
label(df1$med) <- "is it med"
label(df1$high) <- "is it high"
label(df1$other) <- "is it broken"
df2 <- data.frame(item = c("lowYes", "medNo", "high"),
value = c(12, 10, 14))
df3 <- data.frame(item = c("is it low:No", "is it med:Yes", "is it high"),
value = c(12, 10, 14))
library(stringr)
df2$item <- str_replace(df2$item, grep(df2$item, names(df1)), label(df1)) # not for all rows
Extract the label from the 'df1' and create a named vector (unlist), then use the named vector in str_replace_all for modifying the 'item' column by matching the key value with the substring in 'item' column
library(dplyr)
library(stringr)
library(Hmisc)
keyval <- df1 %>%
summarise(across(everything(), ~ str_c(label(.x), ":"))) %>%
unlist
df3 <- df2 %>%
mutate(item = trimws(str_replace_all(item, keyval), whitespace = ":"))
-output
df3
item value
1 is it low:Yes 12
2 is it med:No 10
3 is it high 14
I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
The identify_outliers expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers.
Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]
Goal: To filter rows in dataset so that only distinct words remain At the moment, I have used inner_join to retain rows in 2 datasets which has made my rows in this dataset duplicate.
Attempt 1: I have tried to use distinct to retain only those rows which are unique, but this has not worked. I may be using it incorrectly.
This is my code so far; output attached in png format:
# join warriner emotion lemmas by `word` column in collocations data frame to see how many word matches there are
warriner2 <- dplyr::inner_join(warriner, coll, by = "word") # join data; retain only rows in both sets (works both ways)
warriner2 <- distinct(warriner2)
warriner2
coll2 <- dplyr::semi_join(coll, warriner, by = "word") # join all rows in a that have a match in b
# There are 8166 lemma matches (including double-ups)
# There are XXX unique lemma matches
You can try :
library(dplyr)
warriner2 <- inner_join(warriner, coll, by = "word") %>%
distinct(word, .keep_all = TRUE)
To even further clarify Ronak's answer, here is an example with some mock data. Note that you can just use distinct() at the end of the pipe to keep distinct columns if that's what you want. Your error might very well have occurred because you performed two operations, and assigned the result to the same name both times (warriner2).
library(dplyr)
# Here's a couple sample tibbles
name <- c("cat", "dog", "parakeet")
df1 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
df2 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
# It's much less confusing if you do this in one pipe
p <- df1 %>%
inner_join(df2, by = "name") %>%
distinct()
I've created a function which I am trying to apply to a dataset using pmap. The function I've created amends some columns in a dataset. I want the amendment that's applied to the two columns to carry over to the 2nd and subsequent iterations of pmap.
Reproducible example below:
library(tidyr)
library(dplyr)
set.seed(1982)
#create example dataset
dataset <- tibble(groupvar = sample(c(1:3), 20, replace = TRUE),
a = sample(c(1:10), 20, replace = TRUE),
b = sample(c(1:10), 20, replace = TRUE),
c = sample(c(1:10), 20, replace = TRUE),
d = sample(c(1:10), 20, replace = TRUE)) %>%
arrange(groupvar)
#function to sum 2 columns (col1 and col2), then adjust those columns such that the cumulative sum of the two columns
#within the group doesn't exceed the specified limit
shared_limits <- function(col1, col2, group, limit){
dataset <- dataset
dataset$group <- dataset[[group]]
dataset$newcol <- dataset[[col1]] + dataset[[col2]]
dataset <- dataset %>% group_by(groupvar) %>% mutate(cumulative_sum=cumsum(newcol))
dataset$limited_cumulative_sum <- ifelse(dataset$cumulative_sum>limit, limit, dataset$cumulative_sum)
dataset <- dataset %>% group_by(groupvar) %>% mutate(limited_cumulative_sum_lag=lag(limited_cumulative_sum))
dataset$limited_cumulative_sum_lag <- ifelse(is.na(dataset$limited_cumulative_sum_lag),0,dataset$limited_cumulative_sum_lag)
dataset$adjusted_sum <- dataset$limited_cumulative_sum - dataset$limited_cumulative_sum_lag
dataset[[col1]] <- ifelse(dataset$adjusted_sum==dataset$newcol, dataset[[col1]],
pmin(dataset[[col1]], dataset$adjusted_sum))
dataset[[col2]] <- dataset$adjusted_sum - dataset[[col1]]
dataset <- dataset %>% ungroup() %>% dplyr::select(-group, -newcol, -cumulative_sum, -limited_cumulative_sum, -limited_cumulative_sum_lag, -adjusted_sum)
dataset
}
#apply function directly
new_dataset <- shared_limits("a", "b", "groupvar", 25)
#apply function using a separate parameters table and pmap_dfr
shared_limits_table <- tibble(col1 = c("a","b"),
col2 = c("c","d"),
group = "groupvar",
limit = c(25, 30))
dataset <- pmap_dfr(shared_limits_table, shared_limits)
In the example above the pmap function applies the shared limit to columns "a" and "c" and returns an adjusted dataset as the first element in the list. It then applies the shared limit to columns "b" and "d" and returns this as the second element in the list. However the adjustments that have been made to "a" and "c" are now lost.
Is there any way of storing the adjustments that are made to each column as we progress through each iteration of pmap?
You can iteratively apply a function to your dataset with reduce
First, I'd fix your function since dataset is undefined
shared_limits <- function(df, col1, col2, group, limit){
dataset <- df
dataset$group <- dataset[[group]]
dataset$newcol <- dataset[[col1]] + dataset[[col2]]
dataset <- dataset %>% group_by(groupvar) %>% mutate(cumulative_sum=cumsum(newcol))
dataset$limited_cumulative_sum <- ifelse(dataset$cumulative_sum>limit, limit, dataset$cumulative_sum)
dataset <- dataset %>% group_by(groupvar) %>% mutate(limited_cumulative_sum_lag=lag(limited_cumulative_sum))
dataset$limited_cumulative_sum_lag <- ifelse(is.na(dataset$limited_cumulative_sum_lag),0,dataset$limited_cumulative_sum_lag)
dataset$adjusted_sum <- dataset$limited_cumulative_sum - dataset$limited_cumulative_sum_lag
dataset[[col1]] <- ifelse(dataset$adjusted_sum==dataset$newcol, dataset[[col1]],
pmin(dataset[[col1]], dataset$adjusted_sum))
dataset[[col2]] <- dataset$adjusted_sum - dataset[[col1]]
dataset <- dataset %>% ungroup() %>% dplyr::select(-group, -newcol, -cumulative_sum, -limited_cumulative_sum, -limited_cumulative_sum_lag, -adjusted_sum)
dataset
}
Then make a list of the arguments you want to pass to the function at each step
shared_limits_args_list <- list(
list("a", "c", "groupvar", 25),
list("b", "d", "groupvar", 30))
Then call reduce, setting the dataset as your initial x with the .init parameter. At each iteration a sublist of arguments from shared_limits_args_list will be passed to the function as y. [[ is used to select the list elements for each position. The output dataframe from the function will become the new x for the next iteration, and the next sublist of shared_limits_args_list will be the next set of arguments. When all of the sublists of shared_limits_args_list have been used, the final dataframe is output.
dataset_combined <-
reduce(shared_limits_args_list,
function(x,y) shared_limits(df=x, y[[1]], y[[2]], y[[3]], y[[4]]),
.init=dataset)
I have multiple data frames which are individual sequences, consisting out the same columns. I need to delete all the rows after a negative value is encountered in the column "OnsetTime". So not the row of the negative value itself, but the row after that. All sequences have 16 rows in total.
I think it must be able by a loop, but I have no experience with loops in r and I have 499 data frames of which I am currently deleting the rows of a sequence one by one, like this:
sequence_6 <- sequence_6[-c(11:16), ]
sequence_7 <- sequence_7[-c(11:16), ]
sequence_9 <- sequence_9[-c(6:16), ]
Is there a faster way of doing this? An example of a sequence can be seen here example sequence
Ragarding this example, I want to delete row 7 to row 16
Data
Since the odd web configuration at work prevents me from accessing your data, I created three dataframes based on random numbers
set.seed(123); data_1 <- data.frame( value = runif(25, min = -0.1) )
set.seed(234); data_2 <- data.frame( value = runif(20, min = -0.1) )
set.seed(345); data_3 <- data.frame( value = runif(30, min = -0.1) )
First, you could create a list containing all your dataframes:
list_df <- list(data_1, data_2, data_3)
Now you can go through this list with a for loop. Since there are several steps, I find it convenient to use the package dplyr because it allows for a more readable notation:
library(dplyr)
for( i in 1:length(list_df) ){
min_row <-
list_df[[i]] %>%
mutate( id = row_number() ) %>% # add a column with row number
filter(value < 0) %>% # get the rows with negative values
summarise( min(id) ) %>% # get the first row number
as.numeric() # transform this value to a scalar (not a dataframe)
list_df[[i]] <- list_df[[i]] %>% slice(1:min_row) # get rows 1 to min_row
}
Hope it helps!
We can get the datasets into a list assuming that the object names start with 'sequence' followed by a - and one or more digits. Then use lapply to loop over the list and subset the rows based on the condition
lst1 <- lapply(mget(ls(pattern="^sequence_\\d+$")), function(x) {
i1 <- Reduce(`|`, lapply(x, `<`, 0))
#or use rowSums
#i1 <- rowSums(x < 0) > 0
i2 <- which(i1)[1]
x[seq(i2),]
}
)
data
set.seed(42)
sequence_6 <- as.data.frame(matrix(sample(-1:10, 16 *5, replace = TRUE), nrow = 16))
sequence_7 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
sequence_9 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))