R - Create concatenated column of row-wise unique values across columns - r

My data looks like this:
df <- data.frame(id=1:8,
f1 = c("A","B","B","C","C","C","A","A"),
f2 = c("A",NA,"B",NA,"B","A","B","A"),
f3 = c("A",NA,NA,NA,NA,"A","C","C"))
What I would like to create is a column that contains the unique values present in each row (NAs excluded). So the result would be the column "f_values":
id f1 f2 f3 f_values
1 1 A A A A
2 2 B <NA> <NA> B
3 3 B B <NA> B
4 4 C <NA> <NA> C
5 5 C B <NA> CB
6 6 C A A CA
7 7 A B C ABC
8 8 A A C AC
row1 is A b/c only A appears. row6 is CA because C and A appear uniquely. I'd describe the function as paste row-wise unique. I'm aware that it will be possible chain together a number of comparison operators and paste statements, but the real data has many more columns so I was hoping someone knew an easier way.

Given df above,
f_values<- sapply(apply(df[,-1],1, unique),function(x) paste(na.omit(x),collapse = ""))
df_new<-cbind(df,f_values)
df_new will be the desired outcome as formulated in your question.

We can also do this in data.table by grouping with 'id'.
library(data.table)
setDT(df)[, f_values := paste(na.omit(unique(unlist(.SD))), collapse="") , id]

Related

Manipulating data.table column recursively on other column condition

I need to calculate a formula in a data frame. Each set of values across few columns have to be, lets say simplicity sake, aggregated. However, I do not want calculation across rows. I want to calculate each set with another set based on condition else where.
This is what I mean:
I have a data.table.
data = data.table(A = c("a","c","b","b","a"),
B = c(1:5),
C = c(1:5)
)
setorder(data, by=A)
> data
A B C
1: a 1 1
2: a 5 5
3: b 3 3
4: b 4 4
5: c 2 2
In column D I need to have and aggregate of values in B and C and values B and C when A is "a". As I have more than one "a", multiple aggregations are needed. From every aggregate minimum should be written in.
Here is an example.
For row 1: (1+1)+(1+1)=4, (5+5)+(1+1)=12, so 4 is minimum - D1 =4.
For row 3: (3+3)+(1+1)=8, (3+3)+(5+5)=16, D3 = 8. And so on.
This is what I expect
> data_new
A B C D
1: a 1 1 4
2: a 5 5 12
3: b 3 3 8
4: b 4 4 10
5: c 2 2 6
I tried this and run into issues.
for (i in data)data[i, D:=(min((data[i,B+C]) + (data[a=="a",(B+C)])))]
The expression below for minimum selection works fine on its own when I substitute i for a row number returning list of two numbers for min() returns proper value. Below answer is 8.
min((data[3,B+C]) + (data[A=="a",(B+C)]))
My previous attempts involved grid.expansion() and intersection(). However, with the size of my data set I ran into memory issue and Rstudio quit on me. As a side note, I need to run the calculations as I could not project the smallest outcome by "a" beforehand - it is a set of coordinates and they do not correlate with the magnitude of an answer.
Any suggestion where is my glaring issue
You can store the value of B + C where A = 'a' in a variable (val). For each row you can take minimum of B + C + val value.
library(data.table)
val <- data[A =='a', B + C]
data[, D := min(B + C + val), seq_len(nrow(data))]
data
# A B C D
#1: a 1 1 4
#2: a 5 5 12
#3: b 3 3 8
#4: b 4 4 10
#5: c 2 2 6
You can also use lapply :
data[, D := lapply(B + C, function(x) min(x + val))]
An option is also to replicate the 'a' rows after taking the min of 'B', 'C' and then do a direct + with the 'B', 'C' columns. The advantage is that, we don't have to group or loop
library(data.table)
Reduce(`+`, (data[A == 'a', .(B = min(B), C = min(C))][rep(seq_len(.N), nrow(data))] + data[, .(B, C)]))
#[1] 4 12 8 10 6
Or in a single line
data[, D := B + C + min(B[A== 'a']) + min(C[A== 'a'])]
data$D
#[1] 4 12 8 10 6

How to separate a data frame that has data of varying lengths? [duplicate]

This question already has an answer here:
R: Split Variable Column into multiple (unbalanced) columns by comma
(1 answer)
Closed 3 years ago.
I have to import a table that look like as the following dataframe:
> df = data.frame(x = c("a", "a.b","a.b.c","a.b.d", "a.d"))
> df
x
1 <NA>
2 a
3 a.b
4 a.b.c
5 a.b.d
6 a.d
I'd like to separate the first column in one or more columns based one how many separator I'll find.
The output should lool like this
> df_separated
col1 col2 col3
1 a <NA> <NA>
2 a b <NA>
3 a b c
4 a b d
5 a d <NA>
I tried to use the separate function in tidyr but I need to specify a priori how many outoput columns I need.
Thank you very much for your help
You can first count the number of columns it can take and then use separate.
nmax <- max(stringr::str_count(df$x, "\\.")) + 1
tidyr::separate(df, x, paste0("col", seq_len(nmax)), sep = "\\.", fill = "right")
# col1 col2 col3
#1 a <NA> <NA>
#2 a b <NA>
#3 a b c
#4 a b d
#5 a d <NA>

I need split vector in diffrent column each factor R

My data have different length factors like this.
variable <- c("A,B,C","A,B","A,C","B,C")
I had used strsplit and other similar function, but I can't solve my problem
I need to get a data.frame like this
A B C
1 A B C
2 A B NA
3 A NA C
4 NA B C
Thanks
We could split the data on comma, create dataframe and assign names based on variable name. We can then bind rows by column names using bind_rows from dplyr.
dplyr::bind_rows(sapply(strsplit(variable, ","), function(x)
setNames(as.data.frame(t(x)), x)))
# A B C
#1 A B C
#2 A B <NA>
#3 A <NA> C
#4 <NA> B C
We can use rbindlist
library(data.table)
rbindlist(lapply(strsplit(variable, ","),
function(x) setNames(as.list(x), x)), fill = TRUE)

Check if names are found in different columns and which one

I have a data frame with 4 columns, each column represent a different treatment. Each column is fill with protein numbers on it and the columns have different number of rows between each other. Theres a way to compare all 4 columns and have as a result a fifth column saying if a value is found in which of the columns? I know I have some values that will happen in two or even maybe 3 of the colums and I was wondering if theres a way to get this as end result in a new column.
I tried Data$A %in% Data$B but this just gives me TRUE or FALSE between two columns. I was looking for some option like match or even contain, but all options seens that can only give me a true or false answer.
What I need is something like this.
A B C
1 DSFG DSFG DSGG
2 DDEG DDED DDEE
3 HUGO HUGI HUGO
So if this is my table, I want the result like this
D(?) E
1 DSFG A,B
2 DSGG C
4 DDEG A
5 DDED B
6 DDEE C
7 HUGO A,C
8 HUGI B
Solution
An idea via base R is to use stack to convert to long, and aggregate to get the required output.
aggregate(ind ~ values, stack(df), toString)
# values ind
#1 DDED B
#2 DDEE C
#3 DDEG A
#4 DSFG A, B
#5 DSGG C
#6 HUGI B
#7 HUGO A, C
NOTE: Your columns need to be as.character for this to work. (df[] <- lapply(df, as.character))
Explanations
Stacking turns data into "long format":
stack(df)
values ind
1 DSFG A
2 DDEG A
3 HUGO A
4 DSFG B
5 DDED B
6 HUGI B
7 DSGG C
8 DDEE C
9 HUGO C
toString() simply joins elements in a vector by comma
toString(c("A", "B", "C"))
[1] "A, B, C"
Aggregating returns a vector of "ind"s for each value, and these are then turned into a string using the function above:
aggregate(ind ~ values, stack(df), FUN=toString)
Doing it the tidy way:
Input
df <- data.frame(A = c("DSFG", "DDEG", "HUGO"), B = c("DSFG", "DDED", "HUGI"), C = c("DSGG", "DDEE", "HUGO"))
Summarizing data
library(tidyverse)
df %>%
gather("Column", "Value", 1:3) %>%
group_by(Value) %>%
summarise(Cols = paste(Column, collapse = ","))
Output
Value Cols
DDED B
DDEE C
DDEG A
DSFG A,B
DSGG C
HUGI B
HUGO A,C

How to select rows from a data frame with replacement

I have a data frame defined as follows:
t1 <- data.frame(x=c("A","B","C"),y=c(5,7,9))
> t1
x y
1 A 5
2 B 7
3 C 9
and a vector of picks:
picks <- c("B","C","B")
How do I get these rows, with replacement, in this order selected from the data frame?
I want:
x y
B 7
C 9
B 7
I tried
> t1[t1$x %in% picks,]
x y
2 B 7
3 C 9
and several other combinations of match, grep, which, etc and cannot get out what I want. It seems like it should be easy but I'm not finding the path.
Or you can perform an right join using data.table
library(data.table)
picks <- data.table(x = picks)
setDT(t1)[picks, on = "x"]
# x y
#1: B 7
#2: C 9
#3: B 7
By default the merged data.table is sorted according to x in picks.
We can also use
setNames(t1$y, t1$x)[picks]
#B C B
#7 9 7

Resources