remove duplicate entries in cell - R - r

I searched high and low on here, as well as tried duplicate and unique functions for what I'm about to ask, but couldn't get anything to work. Let's say I have a data frame named company with a variable state. When I collapse the rows I'm left with this output in one of the state variable observations:
PA;PA;PA;TX;TX
How could I remove the dups inside the cell (and entire vector for that matter), so it looks as follows:
PA;TX
I have no problems removing dup rows, but can't seem to do it for the cells themselves.

This works for a single string:
x <- "PA;PA;PA;TX;TX"
x2 <- strsplit(x, ";")
x3 <- unlist(x2)
x4 <- unique(x3)
x5 <- paste(x4, collapse = ";")
If you want to do it for the whole vector company$state, you could roll all that up into one call to sapply:
sapply(company$state, function(x) paste(unique(unlist(strsplit(x, ";"))), collapse = ";"))

Related

Modifying an object referenced by "get()" in R

Apologies if this has been asked before. It's at the limit of my understanding of R, so I'm not even sure of the correct language in which to couch the query (hence, my inability to identify duplicate questions).
In my environment, I have an unknown number of objects (dataframes), each of which has an unknown number of columns that have meaningful names but with nonsense endings, which make it hard to reference them. The meaningful parts of the column names are usually followed by a double period and some further text. I want to automate finding and removing the meaningless suffixes. All the objects I want to modify have ".dat" in their names. Here's my attempt at an example:
# create some objects in my environment
a <- "a string, not of interest to me"
b.dat <- data.frame(col1 = 1:2, col2..gibberish = 3:4)
c.dat <- data.frame(col1..some.text = 5:6, col2 = 7:8)
# find the dataframes that I want to manipulate
dfs <- ls(pattern = ".dat")
# loop through the objects in question, finding and changing the problematic column names
colrename <- lapply(dfs, function(df){
# get the relevant dataframe
dat <- get(df)
# find its column names
nms <- names(dat)
# find the column names with the problematic ".." suffixes
problem.cols <- grep("\\.\\.",nms)
# pull out the meaningful first parts of each problematic name
parts <- strsplit(nms[problem.cols],"\\.\\.")
parts <- sapply(parts, function(x) x[1])
# and, the bit that doesn't work: change the problematic column names to their shorter alternatives
names(get(df))[problem.cols] <<- parts
return(0)
})
If I run this line by line, it does everything I want, up to and including names(get(df))[problem.cols], which it knows are the names of the columns in the dataframe I'm trying to alter. However, it won't assign the altered names to that, yielding the error message: Error in get(*tmp*) : invalid first argument.
I'm open to alternative approaches to achieve my desired end-point. However, I'm also intrigued by why this doesn't work and how, more generally, it's possible to alter an object referenced using "get()". Thanks in advance for any advice - and apologies if this is so naive it's been a waste of your time just reading it.
FWIW, I can see the similarity to this question but I can't adapt the answer to my needs.
Actually, I eventually made the link to using the "assign" function. This seems to work (so I've posted it here, in case it helps anyone else) - but I'd still be interested in alternative solutions:
# loop through the objects in question, finding and changing the problematic column names
colrename <- lapply(dfs, function(df){
# get the relevant dataframe
dat <- get(df)
# find its column names
nms <- names(dat)
# find the column names with the problematic ".." suffixes
problem.cols <- grep("\\.\\.",nms)
# pull out the meaningful first parts of each problematic name
parts <- strsplit(nms[problem.cols],"\\.\\.")
parts <- sapply(parts, function(x) x[1])
# change the problematic column names to their shorter alternatives
nms[problem.cols] <- parts
names(dat) <- nms
assign(df, dat, envir = .GlobalEnv)
return(0)
})

Unique list of variable strings for model estimation

I want to create a vector of unique variable combinations to estimate various regression models for different sets of variables, while fixing one variable to be always included.
For example, I always want to include variable X1, plus a distinct combination of up to, say, three (this threshold could be varying depending on the specific data and research question at hand) other variables from the full list of available variables X2, X3, ..., XN.
The bi-variate case is rather simple, I guess.
However, already for tri-variate models, the variable combination "X1 X2 X3" will yield the same coefficients as "X1 X3 X2". Further, I also want to exclude combinations which contain same variables twice, e.g "X1 X2 X2".
How to exclude these "double-counting"/redundant combinations best? Or how to create such a vector of all possible distinct combinations?
Test code i tried so far (separating variables with underscore):
library(dplyr)
'%!in%' <- function(x,y)!('%in%'(x,y))
A <- c("X1", "X2", "X3", "X4", "X5") # all variables in dataset
a <- "X1" # keep X1 in all models
A_minus_a <- A[A %!in% a]
# first combination:
C1 <- outer(a, A_minus_a, paste, sep = "_")
# second set of combinations:
C2 <- outer(C1, A_minus_a, paste, sep = "_") %>% as.vector
# third set of combinations:
C3 <- outer(C2, A_minus_a, paste, sep = "_") %>% as.vector
# full list of model combinations, but including many "double-counted"/redundant models:
C <- c(C1, C2, C3)
Any help you can provide is very much appreciated!
P.S. for the second step I could prevent the problem by formatting the result of outer() into a matrix and then extracting the lower triangular elements without the diagonal of the matrix. However, when turning to the third set of combinations this does not work anymore. So, there might be a better solution from start.
How about using combn()? e.g. for sets of three variables:
cc <- combn(A_minus_a, m=3)
apply(cc,2,paste,collapse="_")
## [1] "X2_X3_X4" "X2_X3_X5" "X2_X4_X5" "X3_X4_X5"

how to cbind many data-frames?

I have 247 data frames which are sequentially named (y1, y2, y3, ...., y247). They are resulted from the following code:
for (i in (1:247)) {
nam <- paste("y", i, sep = "")
assign(nam, dairy[dairy$FARM==i,"YIT"])
}
I wish to cbind all of them to have:
df <- cbind(y1,y2,...,y247)
Can I do this with a loop without typing all 247 data frames?
Thanks
If you really want to do this, it is possible:
df <- y1
for (i in 2:247) {
df <- cbind(df, eval(parse(text=paste("y", i, sep = ''))))
}
Creating many variables in a loop as you do is not a good idea. You should use a list instead:
ys <- split(dairy$FARM, dairy$FARM)
names(ys) <- paste0("y", names(ys))
The first line creates list ys that contains your y1 as its first element (ys[[1]]), your y2 as its second element (ys[[2]]) and so on. The second line names the list elements the same way as you named your variables (y1, y2, etc.), since those will in the end
be used to name the columns in the data frame.
There is a function in the dplyr package that takes a list of data frames and binds them all together as columns:
library(dplyr)
df <- bind_cols(ys)
Note, by the way, that this will only work, if each value appears exactly the same number of times in the column FARM, since the columns in a data frame must all have the same length.

Lookup Comma Seperating Values in R

I am new to this community, currently working on a R project in which I need to find each of the element separated by comma in a dataframe, on any of the columns in another dataframe, here is an example below:
#DataFrame1
a=c("AA,BB","BB,CC,FF","CC,DD,GG,FF","GG","")
df1=as.data.frame(a)
#DataFrame2
x=c("AA","XX","BB","YY","ZZ","MM","YY","CC")
y=c("DD""VV","NN","XX","CC","AA","WW","FF")
z=c("CC","AA","YY","GG","HH","OO","PP","QQ")
df2=as.data.frame(x,y,z)
what I need to do is find, if any of the elements, lets take for example "AA,BB" (which is the first cell in column x of df1) "AA" is an element and "BB" is another element , is available on any of the columns (x,y,x) in df2, if a match is found I need to identify that row or rows, there is also a possibility of more then one match on df2 rows
. Hope I was able to explain this problem well, expert please help
Here it is a solution in 2 steps:
# load tidyverse
library(tidyverse)
Step 1: Split the elements separated by comma from df1 in a new data frame new_df
1a) To do this, we first identify the number of columns to be generated
(as the maximum number of elements separated by ,; that is: maximum number of , + 1)
number_new_columns <- max(sapply(df1$a, function(x) str_count(x, ","))) + 1
1b) Generate the new data frame new_df
new_df <- df1 %>%
separate(a, c(as.character(seq_len(number_new_columns)))) # missing pieces will be filled with NA
# Above, we used c(as.character(seq_len(number_new_columns))) to generate column names as numbers -- not very creative :)
Step 2: Identify the position of each unique element from new_df in df2
(hope I understood correctly this second part of the question)
2a) Get the unique elements (from new_df)
unique_elements <- unlist(new_df) %>%
unique()
2b) Get a list whose components contain the positions of each unique element within df2
output <- lapply(unique_elements, function(x) {
which(df2 == x, arr.ind=TRUE)
})

automatic column prefix with cbind and just one column

I have some trouble with a script which uses cbind to add columns to a data frame. I select these columns by regular expression and I love that cbind automatically provides a prefix if you add more then one column. Bit this is not working if you just append one column... Even if I cast this column as a data frame...
Is there a way to get around this behaviour?
In my example, it works fine for columns starting with a but not for b1 column.
df <- data.frame(a1=c(1,2,3),a2=c(3,4,5),b1=c(6,7,8))
cbind(df, log=log(df[grep('^a', names(df))]))
cbind(df, log=log(df[grep('^b', names(df))]))
cbind(df, log=as.data.frame(log(df[grep('^b', names(df))])))
A solution would be to create an intermediate dataframe with the log values and rename the columns :
logb = log(df[grep('^b', names(df))]))
colnames(logb) = paste0('log.',names(logb))
cbind(df, logb)
What about
cbw <- c("a","b") # columns beginning with
cbw_pattern <- paste0("^",cbw, collapse = "|")
cbind(df, log=log(df[grep(cbw_pattern, names(df))]))
This way you do select both pattern at once. (all three columns).
Only if just one column is selected the colnames wont fit.

Resources