Combine character variable over rows and columns by group in R

Combine character variable over rows and columns by group in R - r

I am a beginner in R and I am trying to solve a problem in R, which is I guess quite easy for experienced users.
The problem is the following: Customers (A, B, C) are coming in repeatedly using different programms (Prg). I would like to identify "typical sequences" of programs. Therefore, I identify the first programm, they consume, the second, and the third. In a next step, I would like to combine these information to sequences of programms by customer. For a customer first consuming Prg1, then Prg2, then Prg3, the final outcome should be "Prg1-Prg2-Prg3".
The code below produces a dataframe similar to the one I have. Prg is the Programm in the respective year, First is the first year the customer enters, Sec the second and Third the third.
The code produces columns that extract the program consumed in the first contract (Code_1_Prg), second contract (Code_2_Prg) and third contract (Code_3_Prg).
Unfortunately, I am not successful combining these 3 columns to the required goal. I tried to group by ID and save the frist element of the sequence in a new column called "chain1". Here I get the error message "Error in df %>% group_by(ID) %>% df$chain1 = df[df$Code_1_Prg != "NA", :
could not find function "%>%<-", even though I am using the magrittr and dplyr packages.
detach(package:plyr)
library(dplyr)
library(magrittr)
df %>%
group_by(ID) %>%
df$chain1 = df[df$Code_1_Prg!="NA", "Code_1_Prg"]
Below, I share some code, which produces the dataframe and the starting point for extracting the character variable in Code_1_Prg by group.
I would be really grateful, if you could help me with this. Thank you very much in advance!
df <- data.frame("ID"=c("A","A","A","A","B", "B", "B","B","B","C","C", "C", "C","C","C","C"),
"Year_Contract" =c("2010", "2015", "2017","2017","2010","2010", "2015","2015","2020","2015","2015","2017","2017","2017","2018","2018"),
"Prg"=c("AIB","AIB","LLA","LLA","BBU","BBU", "KLU","KLU","DDI","CKN","CKN","BBU","BBU","BBU","KLU","KLU"),
"First"=c("2010","2010","2010","2010","2010","2010", "2010","2010","2010","2015","2015","2015","2015","2015","2015","2015"),
"Sec"=c("2015","2015","2015","2015","2015","2015", "2015","2015","2015","2017","2017","2017","2017","2017","2017","2017"),
"Third"=c("2017","2017","2017","2017","2020","2020", "2020","2020","2020","2018","2018","2018","2018","2018","2018","2018")
)
df$Code_1_Prg <- ifelse(df$Year_Contract == df$First, df$Code_1_Prg <- df$Prg, NA)
df$Code_2_Prg <- ifelse(df$Year_Contract == df$Sec, df$Code_2_Prg <- df$Prg, NA)
df$Code_3_Prg <- ifelse(df$Year_Contract == df$Third, df$Code_3_Prg <- df$Prg, NA)
detach(package:plyr)
library(dplyr)
library(magrittr)
df %>%
group_by(ID) %>%
df$chain1 = df[df$Code_1_Prg!="NA", "Code_1_Prg"]
#This is the final column, I am trying to create
df2 <- data.frame("ID"=c("A","B", "C"),
"Goal" =c("AIB-LLA", "BBU-KLU-DDI", "CKN-BBU-KLU")
)
df <- merge(df, df2, by="ID")

Are you looking for something like this?
libra4ry(dplyr)
df %>%
group_by(ID) %>%
arrange(Year_Contract, .by_group = TRUE) %>%
distinct() %>%
summarise(sequence = toString(Prg))
ID sequence
<chr> <chr>
1 A AIB, AIB, LLA
2 B BBU, KLU, DDI
3 C CKN, BBU, KLU

Related

Check if single column is equal to any multiple others

My question seems simple, but I just can't do it. I have a dataframe with multiple columns with the name starting with coa and another column p with values like A, D, F, and so on, which changes according to the id.
All I found is how to do this matching with a fixed value, let's say "A", as below:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), "A")))
However, in my case, I want to compare to the column p specifically, where p changes, something like this:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), p)))
In this case, I get the error:
x no applicable method for 'type' applied to an object of class "factor"
Any thoughts? Thanks!

If we need to create a column, use if_any
library(dplyr)
library(stringr)
df <- df %>%
mutate(ly = if_any(starts_with("coa"), ~ str_detect(.x, p)))

I think this is a good place to use dplyr::across. You can run vignette('colwise') for a more comprehensive guide, but the key point here is that we can mutate all columns starting with "coa" simultaneously using the function == and we can pass a second argument, p, to == using the ... option provided by across.
library(dplyr)
df <- tibble(p = 1:10, coa1 = 1:10, coa2 = 11:20)
df %>%
mutate(across(.cols = starts_with('coa'), .fns = `==`, p))

most elegant way to calculate rowSums of colums that start AND end with certain strings, using dplyr

I am working with a dataset of which I want to calculate rowSums of columns that start with a certain string and end with an other specified string, using dplyr (in my example: starts_with('c_') & ends_with('_f'))
My current code is as follows (and works fine):
df <- df %>% mutate(row.sum = rowSums(select(select(., starts_with('c_')), ends_with('_f'))))
However, as you can see, using the select() function within a select() function seems a bit messy. Is there a way to combine the starts_with and ends_with within just one select() function? Or do you have other ideas to make this line of code more elegant via using dplyr?
EDIT:
To make the example reproducible:
names <- c('c_first_f', 'c_second_o', 't_third_f', 'c_fourth_f')
values <- c(5, 3, 2, 5)
df <- t(values)
colnames(df) <- names
> df
c_first_f c_second_o t_third_f c_fourth_f
[1,] 5 3 2 5
Thus, here I want to sum the first and fourth column, making the summed value 10.

We could use select_at with matches
library(dplyr)
df %>% select_at(vars(matches("^c_.*_f$"))) %>% mutate(row.sum = rowSums(.))
and with base R :
df$row.sum <- rowSums(df[grep("^c_.*_f$", names(df))])

We can use tidyverse approaches
library(dplyr)
library(purrr)
df %>%
select_at(vars(matches("^c_.*_f$"))) %>%
mutate(rowSum = reduce(., `+`))
Or with new versions of tidyverse, select can take matches
df %>%
select(matches("^c_.*_f$")) %>%
mutate(rowSum = reduce(., `+`))

Simple mutate with dplyr gives "wrong result size" error

My data table df has a subject column (e.g. "SubjectA", "SubjectB", ...). Each subject answers many questions, and the table is in long format, so there are many rows for each subject. The subject column is a factor. I want to create a new column - call it subject.id - that is simply a numeric version of subject. So for all rows with "SubjectA", it would be 1; for all rows with "SubjectB", it would be 2; etc.
I know that an easy way to do this with dplyr would be to call df %>% mutate(subject.id = as.numeric(subject)). But I was trying to do it this way:
subj.list <- unique(as.character(df$subject))
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
And I get this error:
Error: wrong result size (12), expected 72 or 1
Why does this happen? I'm not interested in other ways to solve this particular problem. Rather, I worry that my inability to understand this error reflects a deep misunderstanding of dplyr or mutate. My understanding is that this call should be conceptually equivalent to:
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list))
}
But the latter works and the former doesn't. Why?
Reproducible example:
df <- InsectSprays %>% rename(subject = spray)
subj.list <- unique(as.character(df$subject))
# this works
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list)
}
# but this doesn't
df %>% mutate(subject.id = which(as.character(subject) == subj.list))

The issue is that operators and functions are applied in a vectorized way by mutate. Thus, which is applied to the vector produced by as.character(df$subject) == subj.list, not to each row (as in your loop).
Using rowwise as described here would solve the issue: https://stackoverflow.com/a/24728107/3772587
So, this will work:
df %>%
rowwise() %>%
mutate(subject.id = which(as.character(subject) == subj.list))

Since your df$subject is a factor, you could simply do:
df %>% mutate(subj.id=as.numeric(subject))
Or use a left join approach:
subj.df <- df$subject %>%
unique() %>%
as_tibble() %>%
rownames_to_column(var = 'subj.id')
df %>% left_join(subj.df,by = c("subject"="value"))

Select columns using a vector

I am trying to create a new data frame with 2 columns: var1 and var2, each one of them is the row sum of specific columns in data frame sampData.
library(dplyr)
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
I know that I can select columns using [] and c(), like this:
sampData[ ,c("A","B")]
but when I try to generate and use that format from my vectors like this:
d1_ <-paste(var1, collapse=",")
d2_ <-paste(var2, collapse=",")
sampData[ ,d1_]
I get this error:
Error in `[.data.frame`(sampData, , d1_) : undefined columns selected
Which I also get if I try to calculate the rowSums -- which is what I am interested in getting.
data.frame(var1 = rowSums(sampData[ , d1_])
, var2 = rowSums(sampData[ , d2_])

I think I have managed to figure out what you are asking, but if I am wrong, let me know.
You are trying to select columns from prep that match the values in l1 and l2, and sum across the rows, limited to the columns that matched each.
It is always better to provide reproducible data, here is some for this case (using dplyr to build it):
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
Then, you don't need to concatenate the column indices at all -- just use the variable (or column, in your case) directly. Here, I have made the ID's letters and will match the letters. However, if your ID's are numeric, it will match that index (e.g., 3 will return the third column).
data.frame(
var1sums = rowSums(sampData[, var1])
, var2sums = rowSums(sampData[, var2])
)
Of note, cat returns NULL after printing to the screen. If you need to concatenate values, you will need to use paste (or similar), but that will not work for what you are trying to do here.
This question got me thinking about flexibility of such solutions, so here is an attempt using dplyr and tidyr, which yields effectively the same result. The difference is that this may provide more flexibility for variable selection or even downstream processing.
sampData %>%
# add column for individual
mutate(ind = 1:nrow(.)) %>%
# convert data to long format
gather("Variable", "Value", -ind) %>%
# Set to group by the individual we added above
group_by(ind) %>%
# Calculate sums as desired
summarise(
var1sums = sum(Value[Variable %in% var1])
, var2sums = sum(Value[Variable %in% var2])
)
However, the real advantage would come if you had an arbitrary number (or just a large number generally) of sets of variables that you wanted to get the individual sums from. Instead of manually constructing every column you might be interested in, you can use standard evaluation (as opposed to non-standard) to automatically generate the columns based on a named list of vectors:
sampData %>%
mutate(ind = 1:nrow(.)) %>%
gather("Variable", "Value", -ind) %>%
group_by(ind) %>%
# Calculate one column for each vector in `varList`
summarise_(
.dots = lapply(varList, function(x){
paste0("sum(Value[Variable %in% c('"
, paste(x, collapse = "', '")
, "')])")
})
)

Generating a unique ID column for large dataset with the RecordLinkage package

I am trying to generate a unique ID column using the RecordLinkage package. I have successfully done so when working with smaller datasets (<= 1,000,000), but have not been able to reproduce this result for larger datasets (> 1,000,000) that use different (but similar) functions in the package. I am given multiple identifier variables for which I want to generate a unique ID despite the fact that there may be some errors (near matches) or duplicates in the records.
Given some data frame of identifiers:
data(RLdata500)
df_identifiers <- RLdata500
This is the code for the smaller datesets (which work):
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- compare.dedup(df_identifiers)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.3)
matches <- getPairs(object = classify, show = "links", single.rows = TRUE)
# this code writes an "ID" column that is the same for similar identifiers
classify <- matches %>% arrange(ID.1) %>% filter(!duplicated(ID.2))
df_identifiers$ID_prior <- df_identifiers$ID
# merge matching information with the original data
df_identifiers <- left_join(df_identifiers, matches %>% select(ID.1,ID.2), by=c("ID"="ID.2"))
# replace matches in ID with the thing they match with from ID.1
df_identifiers$ID <- ifelse(is.na(df_identifiers$ID.1), df_identifiers$ID, df_identifiers$ID.1)
This approach is discussed here. But this code does not seem to be extensible when applied towards larger datasets when using other functions. For example, the big data equivalent of compare.dedup is RLBigDataDedup, whose RLBigData class support similar functions such as epiWeights, epiClassify, getPairs, etc. Replacing compare.dedup with RLBigDataDedup does not work in this situation.
Consider the following attempt for large datasets:
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- RLBigDataDedup(df_identifiers)
p=epiWeights(rpairs)
( . . . )
Here, the remaining code is almost identical to that of the first. Although epiWeights and epiClassify work on the RLBigData class as expected, getPairs does not. The function getPairs does not use the show = "links" argument. Because of this, all subsequent code does not work.
Is there a different approach that needs to be taken to generate a column of unique IDs when working with larger datasets in the RLBigData class, or is this just a limitation?

First, import the following libraries:
library(RecordLinkage)
library(dplyr)
library(magrittr)
Consider these example datasets from the RecordLinkage package:
data(RLdata500)
data(RLdata10000)
Assume we care about these matching variables and threshold:
matching_variables <- c("fname_c1", "lname_c1", "by", "bm", "bd")
threshold <- 0.5
The record linkage for SMALL datasets is as follows:
RLdata <- RLdata500
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
compare.dedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(show = "links", single.rows = TRUE) -> matching_data
Here, the following SMALL data manipulation may be applied to append the appropriate IDs to the given dataset (same code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id1, id2) %>%
arrange(id1) %>% filter(!duplicated(id2)),
by = c("ID" = "id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1)) %>%
select(-id1)
RLdata$ID <- RLdata_ID$ID
The equivalent code for LARGE datasets is as follows:
RLdata <- RLdata10000
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
RLBigDataDedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(filter.link = "link", single.rows = TRUE) -> matching_data
Here, the following LARGE data manipulation may be applied to append the appropriate IDs to the given dataset (similar to code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id.1, id.2) %>%
arrange(id.1) %>% filter(!duplicated(id.2)),
by = c("ID" = "id.2")) %>%
mutate(ID = ifelse(is.na(id.1), ID, id.1)) %>%
select(-id.1)
RLdata$ID <- RLdata_ID$ID

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Combine character variable over rows and columns by group in R - r

Are you looking for something like this? libra4ry(dplyr) df %>% group_by(ID) %>% arrange(Year_Contract, .by_group = TRUE) %>% distinct() %>% summarise(sequence = toString(Prg)) ID sequence <chr> <chr> 1 A AIB, AIB, LLA 2 B BBU, KLU, DDI 3 C CKN, BBU, KLU

Related

Check if single column is equal to any multiple others

most elegant way to calculate rowSums of colums that start AND end with certain strings, using dplyr

Simple mutate with dplyr gives "wrong result size" error

Select columns using a vector

Generating a unique ID column for large dataset with the RecordLinkage package

Categories

Resources