Multiple columns processing and dynamically naming new columns - r

Variables are mistakenly being entered into multiple columns eg: "aaa_1", "aaa_2" and "aaa_3", or "ccc_1, "ccc_2", and "ccc_3"). Need to create single new columns (eg "aaa", or "ccc"). Some variables are currently in a single column though ("hhh_1"), but more columns may be added (hhh_2 etc).
This is what I got:
aaa_1 <- c(43, 23, 65, NA, 45)
aaa_2 <- c(NA, NA, NA, NA, NA)
aaa_3 <- c(NA, NA, 92, NA, 82)
ccc_1 <- c("fra", NA, "spa", NA, NA)
ccc_2 <- c(NA, NA, NA, "wez", NA)
ccc_3 <- c(NA, "ija", NA, "fda", NA)
ccc_4 <- c(NA, NA, NA, NA, NA)
hhh_1 <- c(183, NA, 198, NA, 182)
dataf1 <- data.frame(aaa_1,aaa_2,aaa_3,ccc_1,ccc_2, ccc_3,ccc_4,hhh_1)
This is what I want:
aaa <- c(43, 23, NA, NA, NA)
ccc <- c("fra", "ija", "spa", NA, NA)
hhh <- c(183, NA, 198, NA, 182)
dataf2 <- data.frame(aaa,ccc,hhh)
General solution needed as there are ~100 variables (eg "aaa", "hhh", "ccc", "ttt", "eee", "hhh"etc).
Thanks!

This is a base solution, i.e. no packages.
First define get_only which when given a list converts it to a data.frame and applies get_only to each row. When given a vector it returns the single non-NA in it or NA if there is not only one.
Define root to be the column names without the suffixes.
Convert the data frame to a list of columns, group them by root and apply get_only to each such group.
Finally, convert the resulting list to a data frame.
get_only <- function(x) UseMethod("get_only")
get_only.list <- function(x) apply(data.frame(x), 1, get_only)
get_only.default <- function(x) if (sum(!is.na(x)) == 1) na.omit(x) else NA
root <- sub("_.*", "", names(dataf1))
as.data.frame(lapply(split(as.list(dataf1), root), FUN = get_only))
giving:
age country hight
1 43 fra 183
2 23 ija NA
3 NA spa 198
4 NA <NA> NA
5 NA <NA> 182

We may try with splitstackshape
library(splitstackshape)
nm1 <- sub("_\\d+", "", names(dataf1))
tbl <- table(nm1) > 1
merged.stack(dataf1, var.stubs = names(tbl)[tbl], sep="_")

I'm not sure your example is right. For example in the third row you've got values for both age_1 and age_3, then in the desired output NA for that row.
If I've understood what you're trying to do though, it will be much easier if you transpose columns to rows, fix them and then transpose back again. Try this as a start point using the 'tidyverse' of dplyr and tidyr.
library(tidyverse)
library(stringr)
age_1 <- c(43, 23, 65, NA, 45)
age_2 <- c(NA, NA, NA, NA, NA)
age_3 <- c(NA, NA, 92, NA, 82)
country_1 <- c("fra", NA, "spa", NA, NA)
country_2 <- c(NA, NA, NA, "wez", NA)
country_3 <- c(NA, "ija", NA, "fda", NA)
country_4 <- c(NA, NA, NA, NA, NA)
hight_1 <- c(183, NA, 198, NA, 182)
dataf1 <- data.frame(age_1,age_2,age_3,country_1,country_2, country_3,country_4,hight_1)
data <- dataf1 %>%
mutate(row_num = row_number()) %>% #create a row number to track values
gather(key, value, -row_num) %>% #flatten your data
drop_na() %>% #drop na rows
mutate(key = str_replace(key, "_.", "")) %>% #remove the '_x' part of names
group_by(row_num) %>%
top_n(1) %>%
spread(key, value) #pivot back to columns
For your example you need the group_by() and top_n() lines to make it run because you've got multiple values in the same row. If you only have one value (as I think you should?) then you can remove these two lines. It will be better without them because then it won't run if your data is wrong.
Edit following comment below. This will make any duplicated entries NA.
data <- dataf1 %>%
mutate(row_num = row_number()) %>% #create a row number to track values
gather(key, value, -row_num) %>% #flatten your data
drop_na() %>% #drop na rows
mutate(key = str_replace(key, "_.", "")) %>% #remove the '_x' part of names
group_by(row_num, key) %>%
mutate(count = n()) %>% #count how many entries for each row/key combo
mutate(value = ifelse(count > 1, NA, value)) %>% #set NA for rows with duplicates
drop_na() %>%
spread(key, value) %>% #pivot back to columns
select(-count) #drop the `count` variable

Related

dplyr not filtering dates correctly

I have a data structure, it looks as such in the dput form:
test_df <- structure(list(dob = structure(c(-25932, -25932, -25932, -25932,
-25932, -25932, -25932, -25932, -25932, -25932, -25932, -25932,
-25932, -25932, -16955, -13514, -12968, -12419, -12237, -11537,
-10168, -9742, -9376, -9131, -8766, -8676, -8462, -8189, -8036,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), class = "Date")), class = "data.frame", row.names = c(NA,
-45L))
This should produce a variable with 45 rows.
If I then run:
test_include <- test_df %>% filter(dob == '1899-01-01')
This returns the proper amount of rows, which would be 14.
But if I do the opposite of this and filter for all rows that DO NOT equal '1899-01-01' it returns a weird result:
test_exclude <- test_df %>% filter(dob != '1899-01-01')
Instead of returning 31 rows (45 - 14), it returns 15 rows, which makes no sense.
Does anyone have a solution and explanation as to why it is doing this?
Basically, != does not get NA, check this post for more information, but here is an example with your data
library(dplyr)
library(lubridate)
> test_df %>% filter(dob == ymd('1899-01-01')) %>% nrow()
[1] 14
> test_df %>% filter(dob != ymd('1899-01-01')) %>% nrow()
[1] 15
> test_df %>% filter(is.na(dob)) %>% nrow()
[1] 16
> test_df %>% filter(dob != ymd('1899-01-01') | is.na(dob)) %>% nrow()
[1] 31

calculate the mean of column and also the comments in next column

I want to calculate the mean of column and and also concatenate the texts in second column output.
for example in below i want to calculate the mean of C1 and then concatenate all texts in C1T in next column if there is more than one text in C1T.
df <- data.frame(A1 = c("class","type","class","type","class","class","class","class","class"),
B1 = c("b2","b3","b3","b1","b3","b3","b3","b2","b1"),
C1=c(6, NA, 1, 6, NA, 1, 6, 6, 2),
C1T=c(NA, "Part of other business", NA, NA, NA, NA, NA, NA, NA),
C2=c(NA, 4, 1, 2, 4, 4, 3, 3, NA),
C2T=c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
C3=c(3, 4, 3, 3, 6, NA, 2, 4, 1),
C3T=c(NA, NA, NA, NA, "two part are available but not in source", NA, NA, NA, NA),
C4=c(5, 5, 2, NA, NA, 6, 4, 1, 2),
C5T=c(NA, NA, NA, NA, NA, NA, NA, "Critical Expert", NA),
C5=c(6, 2, 6, 4, 2, 2, 5, 4, 1),
C5T=c(NA, NA, NA, NA, NA, "most of things are stuck", "weather responsible", NA, NA))
var <- "C1"
var1 <- "C1T"
var <- rlang::parse_expr(var)
var1 <- rlang::parse_expr(var1)
df1 <- df%>%filter(A1 == "class")
T1<- df1 %>%group_by(B1)%>%summarise(mean=round(mean(!!var,na.rm = TRUE),1))
Comments <- df1 %>% group_by(B1) %>% summarise_at(vars(var1), paste0, collapse = " ") %>%
select(var1) %>% unlist() %>% gsub("NA","",.) %>% stringi::stri_trim_both()
cbind(T1,Comments)
Edited Answer:
var <- "C1"
var1 <- "C1T"
filtercol <- "A1"
filterval <- "class"
groupingvar <- "B1"
var <- rlang::parse_expr(var)
var1 <- rlang::parse_expr(var1)
filtercol <- rlang::parse_expr(filtercol)
groupingvar <- rlang::parse_expr(groupingvar)
library(dplyr)
df1 <- df %>% filter(!!filtercol == filterval)
T1 <- df1 %>% group_by(!!groupingvar) %>% summarise(mean=round(mean(as.numeric(!!var),na.rm = TRUE),1))
Comments <- df1 %>% select(!!groupingvar, !!var1) %>%
group_by(!!groupingvar) %>%
summarise_at(vars(!!var1), paste0, collapse = " ") %>%
select(!!var1) %>% unlist() %>% gsub("NA", "", .) %>%
stringi::stri_trim_both()
T1 <- cbind(T1,Comments)
Update on OP's request (see comments):
library(dplyr)
# helper function to coalesce by column
coalesce_by_column <- function(df) {
return(coalesce(df[1], df[2]))
}
df %>%
pivot_longer(
cols = contains("T"),
names_to = "names",
values_to = "values"
) %>%
filter(names == "C1T") %>%
group_by(names) %>%
summarise(Mean = mean(c_across(C1:C5 & where(is.numeric)), na.rm = TRUE),
Comments = coalesce_by_column(values))
Output:
names Mean Comments
<chr> <dbl> <chr>
1 C1T 3.47 Part of other business
First answer
coalesce to construct Comments column
rowwise with c_across to calculate the mean rowwise.
In case you need to group, you can use ``group_by`
library(dplyr)
df %>%
mutate(Comments = coalesce(C1T, C2T, C3T, C4T, C5T),.keep="unused") %>%
rowwise() %>%
mutate(Mean = mean(c_across(C1:C5 & where(is.numeric)), na.rm = TRUE)) %>%
select(A1, B1, Mean, Comments)
Output:
A1 B1 Mean Comments
<chr> <chr> <dbl> <chr>
1 class b2 5 NA
2 type b3 3.75 Part of other business
3 class b3 2.6 NA
4 type b1 3.75 NA
5 class b3 4 two part are available but not in source
6 class b3 3.25 most of things are stuck
7 class b3 4 weather responsible
8 class b2 3.6 Critical Expert
9 class b1 1.5 NA

Ifelse conditional on same strings in multiple columns

So I guess this is possible to achieve by just making a veeery long line code using mutate() and ifelse() but I want to know if there is a way of doing it without writing a tone of code.
I have data where the degree of each individual is written in a non-ordered fashion. The data looks like this:
id <- c(1, 2, 3, 4, 5, 6)
degree1 <- c("masters", "bachelors", "PhD", "bachelors", "bachelors", NA)
degree2 <- c("PhD", "masters", "bachelors", NA, NA, NA)
degree3 <- c("bachelors", NA, "masters", NA, "masters", NA)
Now I want to create a new column containing the string for the highest degree, like this
dat$highest_degree <- c("PhD", "masters", "PhD", "bachelors", "masters", NA)
How can I achieve this?
An option is to loop over the rows for the selected 'degree' column, convert to factor with levels specified in the order, drop the levels to remove the unused levels and select the first level
v1 <- c("PhD", "masters", "bachelors")
dat$highest_degree <- apply(dat[-1], 1, function(x)
levels(droplevels(factor(x, levels = v1)))[1])
dat$highest_degree
#[1] "PhD" "masters" "PhD" "bachelors" "masters" NA
Or using tidyverse, reshape into 'long' format, then slice the first row after arrangeing the long format column by matching with an ordered degree vector and grouping by 'id', then join with the original data
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(cols = starts_with('degree'), values_to = 'highest_degree') %>%
select(-name) %>%
arrange(id, match(highest_degree, v1)) %>%
group_by(id) %>%
slice_head(n = 1) %>%
ungroup %>%
left_join(dat, .)
data
dat <- data.frame(id, degree1, degree2, degree3)
Here is a base R option using pmin + factor
lvs <- c("PhD", "masters", "bachelors")
dat$highest_degree <- lvs[
do.call(
pmin,
c(asplit(matrix(as.integer(factor(as.matrix(dat[-1]), levels = lvs)), nrow(dat)), 2),
na.rm = TRUE
)
)
]
which gives
> dat
id degree1 degree2 degree3 highest_degree
1 1 masters PhD bachelors PhD
2 2 bachelors masters <NA> masters
3 3 PhD bachelors masters PhD
4 4 bachelors <NA> <NA> bachelors
5 5 bachelors <NA> masters masters
6 6 <NA> <NA> <NA> <NA>
Data
> dput(dat)
structure(list(id = c(1, 2, 3, 4, 5, 6), degree1 = c("masters",
"bachelors", "PhD", "bachelors", "bachelors", NA), degree2 = c("PhD",
"masters", "bachelors", NA, NA, NA), degree3 = c("bachelors",
NA, "masters", NA, "masters", NA)), class = "data.frame", row.names = c(NA,
-6L))

CREATE MULTIPLE DATAFRAMES

I have a dataframe(df) that looks like below:
Objective: I want to create 52 DATAFRAMES, I don't know how to use it with dplyr
Assuming your dataframe is in variable df, try the following code:
library(dplyr)
columns_name = names(df) #names of column in your dataframe
df_list =list() #empty list to store output dataframes
#loop through columns of the original dataframe,
#selecting the first and i_th column and storing the resulting dataframe in a list
for (i in 1:(length(columns_name) -1)){
df_list[[i]] = df %>% select(columns_name[1],columns_name[i+1]) %>% filter_all(all_vars(!is.na(.)))
}
#access smaller dataframes using the following code
df_list[[1]]
df_list[[2]]
Try next code:
library(dplyr)
library(tidyr)
#Code
new <- df %>% pivot_longer(-1) %>%
group_by(name) %>%
filter(!is.na(value))
#List
List <- split(new,new$name)
#Set to envir
list2env(List,envir = .GlobalEnv)
Some data used:
#Data
df <- structure(list(id_unico = c("112172-1", "112195-1", "112257-1",
"112268-1", "112383-1", "112452-1", "112715-1", "112716-1", "112761-1",
"112989-1"), P101COD = c(NA, NA, NA, NA, NA, 411010106L, NA,
NA, 411010106L, NA), P102COD = c(421010102L, 421010102L, 421010102L,
421010102L, 421010102L, NA, 421010108L, 421010108L, NA, 421010102L
), P103COD = c(441010109L, 441010109L, 441010109L, 441010109L,
441010109L, 441010109L, 441010109L, 441010109L, 441010109L, 441010101L
), P110_52_COD = c(NA, 831020103L, 831020103L, NA, 831020103L,
NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-10L))

How can I estimate a function in a group?

I have a data frame with 1530 obs of 6 varaibles. In this dataframe there 51 assets with 30 obs each. I tried to apply de MACD function to obtain two values: macd and signal but show up an error. This is an example:
macdusdt <- filtusdt %>% group_by(symbol) %>% do(tail(., n = 30))
macd1m <- macdusdt %>%
mutate (signals = MACD(macdusdt$lastPrice,
nFast = 12, nSlow = 26, nSig = 9, maType = "EMA", percent = T))
Error: Column signals must be length 30 (the group size) or one, not 3060
I want to apply de MACD function to every asset in the data frame. The database is here: https://www.dropbox.com/s/ww8stgsspqi8tef/macdusdt.xlsx?dl=0
Based on the data provided, it is giving an error when applied the code
Error in EMA(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, :
n > number of non-NA values in column(s) 1
To prevent that we can do
library(dplyr)
library(TTR)
filtusdt %>%
group_by(symbol) %>%
slice(tail(row_number(), 30)) %>%
mutate(signals = if(n() < sum(is.na(lastPrice))) MACD(lastPrice,
nFast = 12, nSlow = 26, nSig = 9, maType = "EMA", percent = TRUE) else NA)
It could be an issue because of the subset dataset provided

Resources