My data have different length factors like this.
variable <- c("A,B,C","A,B","A,C","B,C")
I had used strsplit and other similar function, but I can't solve my problem
I need to get a data.frame like this
A B C
1 A B C
2 A B NA
3 A NA C
4 NA B C
Thanks
We could split the data on comma, create dataframe and assign names based on variable name. We can then bind rows by column names using bind_rows from dplyr.
dplyr::bind_rows(sapply(strsplit(variable, ","), function(x)
setNames(as.data.frame(t(x)), x)))
# A B C
#1 A B C
#2 A B <NA>
#3 A <NA> C
#4 <NA> B C
We can use rbindlist
library(data.table)
rbindlist(lapply(strsplit(variable, ","),
function(x) setNames(as.list(x), x)), fill = TRUE)
Related
My df has two columns a and b.
Rows that contain values for a are NAs in b and vice versa.
I wish to create a new column ab that will contain only values that are not NA
My data:
df = data.frame (a = c(rep("c",4), (rep(NA,4))), b = c(rep(NA,4),rep("e",4)))
I tried first using dplyr
df = df %>%
mutate (ab = ifelse (is.na (a), b, a))
and base
df$ab = ifelse (is.na(df$a), df$b, df$a)
The outcome is the same:
a b ab
1 c <NA> 1
2 c <NA> 1
3 c <NA> 1
4 c <NA> 1
5 <NA> e 1
6 <NA> e 1
7 <NA> e 1
8 <NA> e 1
My questions are:
1. Why does it returns a value that is not in any of the true or false arguments?
2. How can I create a column that combines a and b according to which ever is not NA? (preferably using dplyr)
You have factor columns in the data. Your problem would be solved if you stringsAsFactors = FALSE while constructing the dataframe.
df <- data.frame (a = c(rep("c",4), (rep(NA,4))),
b = c(rep(NA,4),rep("e",4)), stringsAsFactors = FALSE)
However, dplyr has a nice coalesce function which does exactly does what you need without using ifelse.
library(dplyr)
df %>% mutate(ab = coalesce(a, b))
An option with data.table
library(data.table)
setDT(df)[, ab := fcoalesce(a, b)][]
I have a data frame with 4 columns, each column represent a different treatment. Each column is fill with protein numbers on it and the columns have different number of rows between each other. Theres a way to compare all 4 columns and have as a result a fifth column saying if a value is found in which of the columns? I know I have some values that will happen in two or even maybe 3 of the colums and I was wondering if theres a way to get this as end result in a new column.
I tried Data$A %in% Data$B but this just gives me TRUE or FALSE between two columns. I was looking for some option like match or even contain, but all options seens that can only give me a true or false answer.
What I need is something like this.
A B C
1 DSFG DSFG DSGG
2 DDEG DDED DDEE
3 HUGO HUGI HUGO
So if this is my table, I want the result like this
D(?) E
1 DSFG A,B
2 DSGG C
4 DDEG A
5 DDED B
6 DDEE C
7 HUGO A,C
8 HUGI B
Solution
An idea via base R is to use stack to convert to long, and aggregate to get the required output.
aggregate(ind ~ values, stack(df), toString)
# values ind
#1 DDED B
#2 DDEE C
#3 DDEG A
#4 DSFG A, B
#5 DSGG C
#6 HUGI B
#7 HUGO A, C
NOTE: Your columns need to be as.character for this to work. (df[] <- lapply(df, as.character))
Explanations
Stacking turns data into "long format":
stack(df)
values ind
1 DSFG A
2 DDEG A
3 HUGO A
4 DSFG B
5 DDED B
6 HUGI B
7 DSGG C
8 DDEE C
9 HUGO C
toString() simply joins elements in a vector by comma
toString(c("A", "B", "C"))
[1] "A, B, C"
Aggregating returns a vector of "ind"s for each value, and these are then turned into a string using the function above:
aggregate(ind ~ values, stack(df), FUN=toString)
Doing it the tidy way:
Input
df <- data.frame(A = c("DSFG", "DDEG", "HUGO"), B = c("DSFG", "DDED", "HUGI"), C = c("DSGG", "DDEE", "HUGO"))
Summarizing data
library(tidyverse)
df %>%
gather("Column", "Value", 1:3) %>%
group_by(Value) %>%
summarise(Cols = paste(Column, collapse = ","))
Output
Value Cols
DDED B
DDEE C
DDEG A
DSFG A,B
DSGG C
HUGI B
HUGO A,C
This question already has answers here:
How to add multiple columns to a data.frame in one go?
(2 answers)
Closed 4 years ago.
I am in the process of reformatting a few data frames and was wondering if there is a more efficient way to add named columns to data frames, rather than the below:
colnames(df) <- c("c1", "c2)
to rename the current columns and:
df$c3 <- ""
to create a new column.
Is there a way to do this in a quicker manner? I'm trying to add dozens of named columns and this seems like an inefficient way of going through the process.
use your method in a shorter way:
cols_2_add=c("a","b","c","f")
df[,cols_2_add]=""
A way to add additional columns can be achieved using merge. Apply merge on existing dataframe with the one created with a desired columns and empty rows. This will be helpful if you want to create columns of different types.
For example:
# Existing dataframe
df <- data.frame(x=1:3, y=4:6)
#use merge to create say desired columns as a, b, c, d and e
merge(df, data.frame(a="", b="", c="", d="", e=""))
# Result
# x y a b c d e
#1 1 4
#2 2 5
#3 3 6
# Desired columns of different types
library(dplyr)
bind_rows(df, data.frame(a=character(), b=numeric(), c=double(), d=integer(),
e=as.Date(character()), stringsAsFactors = FALSE))
# x y a b c d e
#1 1 4 <NA> NA NA NA <NA>
#2 2 5 <NA> NA NA NA <NA>
#3 3 6 <NA> NA NA NA <NA>
A simple loop can help here
name_list <- c('a1','b1','c1','d1')
# example df
df <- data.frame(a = runif(3))
# this adds a new column
for(i in name_list)
{
df[[i]] <- runif(3)
}
# output
a a1 b1 c1 d1
1 0.09227574 0.08225444 0.4889347 0.2232167 0.8718206
2 0.94361151 0.58554887 0.7095412 0.2886408 0.9803941
3 0.22934864 0.73160433 0.6781607 0.7598064 0.4663031
# in case of data.table, for-set provides faster version:
# example df
df <- data.table(a = runif(3))
for(i in name_list)
set(df, j=i, value = runif(3))
i'm trying to assign an NA value to each row in my b column corresponding to NA in my a column.
The columns are in a data frame df.
But when i do the following code, all my b column gets NA.
What should i change ?
for(i in 1:nrow(df))
{
row <- df[i,]
is.na(df$`a`) <- (df$b <- NA)
}
For this, we can make use ofthe vectorized option by creating a logical vector (is.na(df$a)), use this to subset the elements of 'b' and assign it to NA
df$b[is.na(df$a)] <- NA
If we are using data.table, this can be assigned (:=) in place
library(data.table)
setDT(df)[is.na(a), b := NA]
According to the documentation ?is.na, is.na<- is a generic function which sets elements to NA. For the right hand side of is.na(x) <- value, value needs to be a suitable index vector for use with x.
Examples:
df <- data.frame(a = LETTERS[1:5], b = 1:5)
is.na(df$b) <- c(2, 4)
df
# a b
#1 A 1
#2 B NA
#3 C 3
#4 D NA
#5 E 5
So, the 2nd and 4th elements of vector df$b have been set to NA.
Now, if the corresponding elements of df$a should be set to NAas well, we can use:
is.na(df$a) <- is.na(df$b)
df
# a b
#1 A 1
#2 <NA> NA
#3 C 3
#4 <NA> NA
#5 E 5
NB: I've learned about this feature from answers and comments on Why does is.na() change its argument?
My data looks like this:
df <- data.frame(id=1:8,
f1 = c("A","B","B","C","C","C","A","A"),
f2 = c("A",NA,"B",NA,"B","A","B","A"),
f3 = c("A",NA,NA,NA,NA,"A","C","C"))
What I would like to create is a column that contains the unique values present in each row (NAs excluded). So the result would be the column "f_values":
id f1 f2 f3 f_values
1 1 A A A A
2 2 B <NA> <NA> B
3 3 B B <NA> B
4 4 C <NA> <NA> C
5 5 C B <NA> CB
6 6 C A A CA
7 7 A B C ABC
8 8 A A C AC
row1 is A b/c only A appears. row6 is CA because C and A appear uniquely. I'd describe the function as paste row-wise unique. I'm aware that it will be possible chain together a number of comparison operators and paste statements, but the real data has many more columns so I was hoping someone knew an easier way.
Given df above,
f_values<- sapply(apply(df[,-1],1, unique),function(x) paste(na.omit(x),collapse = ""))
df_new<-cbind(df,f_values)
df_new will be the desired outcome as formulated in your question.
We can also do this in data.table by grouping with 'id'.
library(data.table)
setDT(df)[, f_values := paste(na.omit(unique(unlist(.SD))), collapse="") , id]