Mutate multiple new columns based on references to themselves - r

A data frame:
df <- data.frame(
date = seq(ymd('2021-01-01'), ymd('2021-01-31'), by = 1),
ims_x = rnorm(31, mean = 0),
ims_y = rnorm(31, mean = 1),
ims_z = rnorm(31, mean = 2),
blah = 1:31
)
I'd like to mutate 3 new fields (not overwrite), 'ims_x_lagged', 'ims_y_lagged' and 'ims_z_lagged' where each new field corresponds to the original but lagged by one day/row. The names of the new fields would just have '_lagged' appended onto the name of the original and the value would change to be that of it's original in the preceding row.
I could do this manually for each field, but that would be a lot of typing and my real data has many more than 3 fields that need to be lagged.
Something kind of like this, if it's possible to tell what I'm trying to do:
df <- df %>%
mutate_at(vars(contains('ims_')) := lag(vars(contains('ims_')))) # but append '_lagged' to the name

With the new version of dplyr, _at or _all are getting deprecated and in its place, can use the more flexible across. If we don't specify the .names, it will replace the modified column values with the same column name. By specifying the .names, the {.col} - returns the original column name and can add either prefix or suffix as a string.
library(dplyr)
df <- df %>%
mutate(across(starts_with('ims'), lag, .names = "{.col}_lagged"))

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))

How to summarize rows of a data frame into one while removing Duplicates in R?

so, I have a data frame with 2 or more rows and different columns (ID, Location, Task, Skill, ...). I want to summarize these rows into (a) one row (dataframe) where different column entries should be joined together (but only if different! i.e. if for two rows the IDs are the same, the final dataframe row should show only one ID not the same twice i.e. "ID1", but if they are different, both should be shown i.e. 'ID1, ID2") and some numerical values should be added (+) together.
df = data.frame("ID" = c(PA1, PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures, react to changes
", "PO - data based operations"), "Field" = c("PA","PA"), "Work" = c(0.5, 0.1), "Skill1" = c(CRO, CRO), "Skill2" = c(0, PPto), "ds" = c(5, 5))
print(df)
and the output should look like this
df_final = data.frame("ID" = c(PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures, react to changes, data based operations"), "Field" = c("PA"), "Work" = c(0.6), "Skill1" = c(CRO), "Skill2" = c(PPto), "ds" = c(5))
print(df_final)
Thank you!
Let's ignore Skill2 for now:
How close is the following code to what you want to do?
df2 %>%
group_by(ID)%>%
summarise(work = sum(Work),
skill1 = unique(Skill1),
ds = unique(ds),
occupation = paste0(Occupation, collapse = " "),
field = unique(Field))
You can also mutate(occupation = str_replace_all(occupation, "PO - ")) to get rid of the duplicate "PO - "'s.
You're going to run into problems if the variables like Skill1/Skill2/ds are not unique to each ID, as in they have cardinality > 1.
df2 %>%
group_by(ID)%>%
summarise(work = sum(Work),
skill1 = unique(Skill1),
skill2 = unique(Skill2),
ds = unique(ds),
occupation = paste0(Occupation, collapse = " "),
field = unique(Field))
If it's a simple data-entry issue, you could do a bit of wrangling to filter for only Skill2 entries with letters contained, and then join this frame back to your original frame.
You could also use the past0() collapse = trick, but then you'll end up with Skill2 = c(NA, "PPto"), which I'm pretty sure you don't want.

How to rename and add "_cntr" to the middle of the columns of a dataframe within a function

I am new to programming in R and I have had to make a function that collects a dataframe and returns that same dataframe with twice as many columns as the original and in those new columns, the values have to be the original value minus the mean (the mean is row 51 of the dataframe). The fact is that I have made the function and it works, the only thing I need to do is rename column 9:16 of the dataframe, they have to have the same name as the original columns and add "_cntr" to them.
I had thought to add the _cntr with the paste function, but it does not work for me or I am not using it well, I had thought something like this:
nom = paste("cntr",sep = '_')
colnames(state.df3) = nom
and this put it inside the function that I will share next, but this changes the name of the first column by centr and leaves the rest of the columns with the value NA.
If I do that:
nom = paste("cntr",9:16,sep = '_')
colnames(state.df3) = nom
It returns cntr1, cntr2, cntr3 ... and I don't want it to return that, I want it to return "Population_cntr", "Income_cntr", "Illiteracy_cntr" ... all that from column 9 to 16 (since is where duplicates start)
The dataframe that I am using as a test can be accessed here:
state.df = as.data.frame(state.x77)
And this is the function that I have done so far, I would only need to modify the names of the 9:16 columns.
mi_funcion <- function(df) {
row_medias <- tail(df, 1)
row_resto <- head(df, -1)
tmp <- rbind(row_resto - as.list(row_medias), row_medias)
resultado = cbind(df, tmp)
return(resultado)
}
If someone could give me a hand and tell me where I am failing I would be very grateful.
This is just an example so for yours replace 1:2 with 9:16
df <- data.frame(Population = c(10),Income = c(20000),Illiteracy = ("Y"))
df
Population Income Illiteracy
1 10 20000 Y
colnames(df)[1:2] <- paste(colnames(df)[1:2],"cntr", sep = "_")
Output:
df
Population_cntr Income_cntr Illiteracy
1 10 20000 Y
We can also use dplyr's rename_with:
library(dplyr)
state.df3 %>% rename_with(~paste(.x, "cntr",sep = '_'), .cols = everything())

How to create a new column based on partial string of another column

I have a data frame with a vector of thousands of project codes, each representing a different type of research. Here's an example:
Data <- data.frame(Assignment = c("C-209", "B-543", "G-01", "LOG"))
The first letter of the assignment code denotes the type of research. C = Cartography, B = Biology, G = Geology, and LOG = Logistics.
I would like to create a new column that looks at the first letter of the Assignment column and uses it to denote the type of research it is.
I've tried something similar to this thread, but I know I'm missing something:
R - Creating New Column Based off of a Partial String
Data <- data.frame(Assignment = c("C-209", "B-543", "G-01", "LOG"))
Types <- data.frame(Type = c("Cartography", "Biology", "Geology","Logistic"),
stringsAsFactors = FALSE)
Data %>%
mutate(Type = str_match(Assignment, Types$Type)[1,])
You can add a new column Code in your Types data.frame and then join it with original table. You will need to create a Code column in your Data data.frame too.
library(dplyr)
library(stringr)
Data <- data.frame(Assignment = c("C-209", "B-543", "G-01", "LOG"))
Types <- data.frame(Type = c("Cartography", "Biology", "Geology","Logistic"),
Code = c("C","B","G","L"), # Create new column here
stringsAsFactors = FALSE)
Data <- Data %>% mutate(Code = substr(Assignment,1L,1L)) # extract first character
Data <- left_join(Data, Types, by = "Code") %>% select(Assignment, Type) # combine

r aggregate and collapse several cells into one

I have a data frame:
x <- data.frame(id = 1:18,
super = c(rep("A", 12), rep("B", 6)),
category = c(rep("one", 6), rep("two", 6), rep("three", 6)),
root = sort(rep(letters[1:6], 3)),
coldefs = letters[1:18], stringsAsFactors = F)
x
I am creating a new column by concatenating 3 columns:
myvars <- c("super", "category", "root")
library(tidyverse)
x <- x %>% unite(col = concat, myvars, sep = "_", remove = F)
x
Now, for each unique value of column 'concat' the values of column 'super' are the same, the values of column 'category' are the same, and the values of column "root" are the same. However, for each unique value of column 'concat' the values of column 'id' are different. The same is true for column 'coldefs'.
I would like to collapse (aggregate) x so that it has only as many rows as there are unique values in column 'concat' (i.e., 6 rows). In each row, I want one value from column 'super', one value from column 'category', one value from column 'root'; and then 3 values of column 'id' (concatenated like this: 1;2;3) and 3 values of column 'coldefs' (concatenated like this: a;b;c).
What's the best way of doing it?
I am trying the following, but it's not working:
x %>% group_by(concat) %>% summarize(id = paste(id, collapse = ";"),
super = unique(super), category = unique(category), root = unique(root),
coldefs = paste(coldefs, collapse = ";"))
I am clearly doing something wrong.
Thanks a lot for your help!
I must say this is a bit (or completely) crazy! I tried my code (the one at the bottom) piece by piece and it worked. I merged it all together - and it worked. I don't understand why was I getting an error before. Here is the correct code that works (at least now):
x %>% group_by(concat) %>% summarize(id = paste(id, collapse = ";"), super = unique(super),
category = unique(category), root = unique(root),
coldefs = paste(coldefs, collapse = ";"))

Resources