Create a Column with Unique values from Lists Columns - r

I have a dataset on Rstudio made of columns that contains lists inside them. Here is an example where column "a" and column "c" contain lists in each row.
¿What I am looking for?
I need to create a new column that collects unique values from columns a b and c and that skips NA or null values
Expected result is column "desired_result".
test <- tibble(a = list(c("x1","x2"), c("x1","x3"),"x3"),
b = c("x1", NA,NA),
c = list(c("x1","x4"),"x4","x2"),
desired_result = list(c("x1","x2","x4"),c("x1","x3","x4"),c("x2","x3")))
What i have tried so far?
I tried the following but do not produces the expected result as in column "desired_result
test$attempt_1_ <-lapply(apply((test[, c("a","b","c"), drop = T]),
MARGIN = 1, FUN= c, use.names= FALSE),unique)

We may use pmap to loop over each of the corresponding elements of 'a' to 'c', remove the NA (na.omit) and get the unique values to store as a list in 'desired_result'
library(dplyr)
library(purrr)
test <- test %>%
mutate(desired_result2 = pmap(across(a:c), ~ sort(unique(na.omit(c(...))))))
-checking with OP's expected
> all.equal(test$desired_result, test$desired_result2)
[1] TRUE

Related

How to subset columns based on column names that matches that of another dataframe?

I want to retain the column of met.kirp.se only if the column names match that of exp.kirp.log2 df. exp.kirp.log2 has more columns than met.kirp.se and the code should make the number of columns in both data frames match.
met.kirp.se <- met.kirp.se[, colnames(met.kirp.se) %in% colnames(exp.kirp.log2)]
The number of columns in met.kirp.se is still different from exp.kirp.log2.
ncol(met.kirp.se)
274
ncol(exp.kirp.log2)
290
I'm sure there's a more concise way, but:
small_df <- data.frame(one = LETTERS[1:2], two = LETTERS[3:4])
large_df <- data.frame(one = LETTERS[5:6], three = LETTERS[7:8], four = 4)
small_df[colnames(small_df)[colnames(small_df) %in% colnames(large_df)]]
# one
#1 A
#2 B
small_df["one"] will output small_df with just the "one" column.
colnames(small_df) %in% colnames(large_df) will output TRUE FALSE since "one" is in large_df but "two" is not.
colnames(small_df)[colnames(small_df) %in% colnames(large_df)] will output "one", which we can use to subset small_df.

Creating several new columns in a data frame using the same function

I'm sorry for the basic question. I'm just struggling with something that should be simple. Say I have the the data frame "Test" that originally has three fields: Col1, Col2, Col3.
I want to create new columns based on each of the original columns. The values in each row of the new columns would specify whether the corresponding value in the matching row on the original column is above or below the initial column's median. So, for example, in the image attached, Col4 is based on Col1. Col5 is based on Col2. Col6 based on Col3.
test dataframe example:
It's quite easy to perform this function on a single column and output a single column:
Test <- Test %>% mutate(Col4 = derivedFactor(
"below"= Col1 > median(Test$Col1),
"at"= Col1 == median(Test$Col1),
"above"= Col1 < median(Test$Col1)
.default = NA)
)
But if I'm performing this same operation over 50 columns, writing out/copy-paste and editing the code can be tedious and inefficient. I should mention that I am hoping to add the new columns to the data frame, not create another data frame. Additionally, there are about 200 other fields in the data frame that will not have this function performed on them (so I can't just use a mutate_all). And the columns are not uniformly named (my examples above are just examples, not the actual dataset) so I'm not able to find a pattern for mutate_at. Maybe there is a way to manually pass a list of column names to the mutate command?
There must be an easy and elegant way to do this. If anyone could help, that would be amazing.
You can do the following using data.table.
Firstly, I define a function which is applied onto a numeric vector, whereby it outputs the elements' corresponding position in relation to the vector's median:
med_fn = function(x){
med = median(x)
unlist(sapply(x, function(x){
if(x > med) {'Above'}
else if(x < med) {'Below'}
else {'At'}
}))
}
> med_fn(c(1,2,3))
[1] "Below" "At" "Above"
Let us examine some sample data:
dt = data.table(
C1 = c(1, 2, 3),
C2 = c(2, 1, 3),
C3 = c(3, 2, 1)
)
old = c('C1', 'C2', 'C3') # Name of columns I want to perform operation on
new = paste0(old, '_medfn') # Name of new columns following operation
Using the .SD and .SDcols arguments from data.table, I apply med_fn across the columns old, in my case columns C1, C2 and C3. I call the new columns C#_medfn:
dt[, (new) := lapply(.SD, med_fn), .SDcols = old]
Result:
> dt
C1 C2 C3 C1_medfn C2_medfn C3_medfn
1: 1 2 3 Below At Above
2: 2 1 2 At Below At
3: 3 3 1 Above Above Below

How to get all columns with the same column name in R at once?

Let's say I have the following data frame:
> test <- cbind(test=c(1, 2, 3), test=c(1, 2, 3))
> test
test test
[1,] 1 1
[2,] 2 2
[3,] 3 3
Now from such data frame I want to fetch all the columns named "test" to a new data frame:
> new_df <- test[, "test"]
However this last attempt to do so only fetches the first column called "test" in test data frame:
> new_df
[1] 1 2 3
How can I get all of the columns called "test" in this example and put them into a new data frame in a single command? In my real data I have many columns with repeated colnames and I don't know the index of the columns so I can`t get them by number.
It is not advisable to have same column names for practical reasons. But, we can do a comparison (==) to get a logical vector and use that to extract the columns
i1 <- colnames(test) == "test"
new_df <- test[, i1, drop = FALSE]
Note that data.frame doesn't allow duplicate column names and would change it to unique by appending .1 .2 etc at the end with make.unique. With matrix (the OP's dataset), allows to have duplicate column names or row names (not recommended though)
Also, if there are multiple column names that are repeated and want to select them as separate datasets, use split
lst1 <- lapply(split(seq_len(ncol(test)), colnames(test)), function(i)
test[, i, drop = FALSE])
Or loop through the unique column names and do a == by looping through it with lapply
lst2 <- lapply(unique(colnames(test)), function(nm)
test[, colnames(test) == nm, drop = FALSE])

How to extract rows of a data frame between two characters

I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)

Store output of sapply into a data frame?

how can I store the output of sapply() to a dataframe where the index value is stored in first column and its value in corresponding 2nd column. For illustration, I have shown only 2 elements here, but there are 110 columns in my data. "loan" is the data frame.
cols <- sapply(loan,function(x) sum(is.na(x)))
cols
id
0
member_id
7
I want output as:
var value
id 0
member_id 7
I know that sapply() returns a vector, but when I print the vector, values are printed along with its some "index" e.g., column name if applied on a data frame. So, now when I want to store it as a data frame with two columns where 1st column contains the index part and the second column contains the value, how can I do it?
I found an answer to my question. For those who actually did understand my problem, this answer might make sense:
cols <- data.frame(sapply(loan ,function(x) sum(is.na(x))))
cols <- cbind(variable = row.names(cols), cols)
I wanted the row.names to be in a column of the same data frame corresponding to the values obtained from sapply.
We can use stack
stack(mylist)[2:1]
data
mylist <- list(df = 1, rf = 2)
Is this what you want?
Your original list:
L <- c("df",1,"rf",2)
L
[1] "df" "1" "rf" "2"
As a data frame:
N <- length(L)
df <- data.frame( var = L[seq(1,N,2)], value = L[seq(2,N,2)] )
df
var value
1 df 1
2 rf 2

Resources