Splitting a dataframe by column name indices - r

This is a variation of an earlier question.
df <- data.frame(matrix(rnorm(9*9), ncol=9))
names(df) <- c("c_1", "d_1", "e_1", "a_p", "b_p", "c_p", "1_o1", "2_o1", "3_o1")
I want to split the dataframe by the index that is given in the column.names after the underscore "_". (The indices can be any character/number in different lengths; these are just random examples).
indx <- gsub(".*_", "", names(df))
and name the resulting dataframes accordingly n the end i would like get three dataframes, called:
df_1
df_p
df_o1
Thank you!

Here, you can split the column names by indx, get the subset of data within the list using lapply and [, set the names of the list elements using setNames, and use list2env if you need them as individual datasets (not so recommended as most of the operations can be done within the list and later if you want, it can be saved using write.table with lapply.
list2env(
setNames(
lapply(split(colnames(df), indx), function(x) df[x]),
paste('df', sort(unique(indx)), sep="_")),
envir=.GlobalEnv)
head(df_1,2)
# c_1 d_1 e_1
#1 1.0085829 -0.7219199 0.3502958
#2 -0.9069805 -0.7043354 -1.1974415
head(df_o1,2)
# 1_o1 2_o1 3_o1
#1 0.7924930 0.434396 1.7388130
#2 0.9202404 -2.079311 -0.6567794
head(df_p,2)
# a_p b_p c_p
#1 -0.12392272 -1.183582 0.8176486
#2 0.06330595 -0.659597 -0.6350215
Or using Map. This is similar to the above approach ie. split the column names by indx and use [ to extract the columns, and the rest is as above.
list2env(setNames(Map(`[` ,
list(df), split(colnames(df), indx)),
paste('df',unique(sort(indx)), sep="_")), envir=.GlobalEnv)
Update
You can do:
indx1 <- factor(indx, levels=unique(indx))
split(colnames(df), indx1)

you can try this :
invisible(sapply(unique(indx),
function(x)
assign(paste("df",x,sep="_"),
df[,grepl(paste0("_",x,"$"),colnames(df))],
envir=.GlobalEnv)))
# the code applies to each unique element of indx the assignement (in the global environment)
# of the columns corresponding to indx in a new data.frame, named according to the indx.
# invisible function avoids that the data.frames are printed on screen.
> ls()
[1] "df" "df_1" "df_o1" "df_p" "indx"
> df_1[1:3,]
c_1 d_1 e_1
1 1.8033188 0.5578494 2.2458750
2 1.0095556 -0.4042410 -0.9274981
3 0.7122638 1.4677821 0.7770603
> df_o1[1:3,]
1_o1 2_o1 3_o1
1 -2.05854176 -0.92394923 -0.4932116
2 -0.05743123 -0.24143979 1.9060076
3 0.68055653 -0.70908036 1.4514368
> df_p[1:3,]
a_p b_p c_p
1 -0.2106823 -0.1170719 2.3205184
2 -0.1826542 -0.5138504 1.9341230
3 -1.0551739 -0.2990706 0.5054421

Related

Remove data.table rows whose vector elements contain nested NAs

I need to remove from a data.table any row in which column a contains any NA nested in a vector:
library(data.table)
a = list(as.numeric(c(NA,NA)), 2,as.numeric(c(3, NA)), c(4,5) )
b <- 11:14
dt <- data.table(a,b)
Thus, rows 1 and 3 should be removed.
I tried three solutions without success:
dt1 <- dt[!is.na(a)]
dt2 <- dt[!is.na(unlist(a))]
dt3 <- dt[dt[,!Reduce(`&`, lapply(a, is.na))]]
Any ideas? Thank you.
You can do the following:
dt[sapply(dt$a, \(l) !any(is.na(l)))]
This alternative also works, but you will get warnings
dt[sapply(dt$a, all)]
Better approach (thanks to r2evans, see comments)
dt[!sapply(a,anyNA)]
Output:
a b
1: 2 12
2: 4,5 14
A third option that you might prefer: You could move the functionality to a separate helper function that ingests a list of lists (nl), and returns a boolean vector of length equal to length(nl), and then apply that function as below. In this example, I explicitly call unlist() on the result of lapply() rather than letting sapply() do that for me, but I could also have used sapply()
f <- \(nl) unlist(lapply(nl,\(l) !any(is.na(l))))
dt[f(a)]
An alternative to *apply()
dt[, .SD[!anyNA(a, TRUE)], by = .I][, !"I"]
# a b
# <list> <int>
# 1: 2 12
# 2: 4,5 14

How to get a value from vector and assign it as a name of new dataframe

IN R
I have a vector of NAME:
[1] "ALKR50SV" "AMKR71SV" "AOKR71SV" "AZKR52SV" "BFKR70SV" "BJKR61SV" "BUKR6HSV"
"CDKR61SV" "CFKR31SV"
I want to use them as a name for each new dataframe
Like dataframe of ALKR50SV, dataframe of ALKR50SV ......
for loop like:
NAME[i] <- data1
will cause problem.
What should I do? Thank you.
As #joran and #neilfws said, best to work with a list of data.frames.
For example, consider the following list of three data.frames
lst <- lapply(1:3, function(x) as.data.frame(matrix(sample(20), ncol = 4)));
You can name list elements
names(lst) <- c("ALKR50SV", "AMKR71SV", "AOKR71SV");
and operate on list elements using lapply, e.g.
lapply(lst, dim);
#$ALKR50SV
#[1] 5 4
#
#$AMKR71SV
#[1] 5 4
#
#$AOKR71SV
#[1] 5 4
You can use assign:
numbers <- c('one', 'two', 'three')
for (i in 1:3) {
assign(nms[i], i)
}
one # 1
two # 2
three # 3
But as others have commented, it is most likely better to put your dataframes into a named list.

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

How can I present only specific list as my output after using a condition on many other lists

Suppose I have huge number of lists each contains 3 rows ( I present here 3 of them) and I would like to get the name (List1,List2,etc..) of the list that have the minimum values per first and third rows out of the given 3 rows.In this case List3 is the answar (0.1948026 and 0.1125526 have the minimum values of all lists),How can I present only List3 as my output?
list1<-list(
0.3318594
,0.1296125
, 0.1262203)
list2<- list(
0.3654229
,0.1428565
,0.1552035)
list3<- list(
0.1948026
,0.1272514
,0.1125526)
data.table is probably going to be the fastest solution for this if you have lots of lists.
You could do:
library(data.table)
#add all in a list
the_lists <- list(list1, list2, list3)
Or it would probably be much better (if your lists are all in the global environment) to do the following as per #DavidArenburg 's comment:
#this will create a list with all lists in your global env
#that are named list1, list2, list3 etc.
the_lists <- mget(ls(pattern = "list.+"))
#create a data table ouf of them
#notice that every row represents a list here
all_lists <- rbindlist(the_lists)
#find the list with the minimum row
#which for this case means find the min location of each column
mins <- as.numeric(all_lists[, lapply(.SD, which.min)])
#> mins
#[1] 3 3 3
And then just use mins to retrieve the list you want.
For row 1 use:
> the_lists[mins[1]]
$list3
$list3[[1]]
[1] 0.1948026
$list3[[2]]
[1] 0.1272514
$list3[[3]]
[1] 0.1125526
and for row 3:
> the_lists[mins[3]]
$list3
$list3[[1]]
[1] 0.1948026
$list3[[2]]
[1] 0.1272514
$list3[[3]]
[1] 0.1125526
Using mget as suggested by #DavidArenburg the list names are created, and will be shown as above.
To get the value and the names:
> data.frame(min_loc = mins[c(1,3)], names = names(the_lists)[c(mins[c(1,3)])])
min_loc names
1 3 list3
2 3 list3
Your lists are defined in your global envrionment and not in a list .. which is a bad habit. Despite this, you can solve your problem this way:
# first catch your lists names in your envrionment
lnames = Filter(function(x) class(get(x))=='list', ls(pattern="list\\d+", env=globalenv()))
# gather values in the matrix - the colummn names will be the list names
m = sapply(lnames, get)
# to get the name of the list(s) with min value in 1st and 3rd position
colnames(m)[unique(apply(m[c(1,3),],1,which.min))]
#[1] "list3"
Try this:
# Collect lists
collection.list <- list("list1"=list1,"list2"=list2,"list3"=list3)
#Build data
matrix <- do.call(rbind,collection.list)
# Select columns
used.columns <- c(1,3)
# Find minimum value
min.ind <- which(matrix[,used.columns]==min(unlist(matrix[,used.columns])),arr.ind = TRUE)
# Find name
names(collection.list)[min.ind[,"row"]]
I think this should work,
common_list <- mapply(c, list1, list2, list3, SIMPLIFY=FALSE)
a <- lapply(mapply(c, list1, list2, list3, SIMPLIFY=FALSE), min)
b <- paste("list", unlist(lapply(mapply(c, list1, list2, list3, SIMPLIFY=FALSE), which.min)))
data.frame(Min_value = unlist(a), List = unlist(b))
# Min_value List
# 1 0.1948026 list 3
# 2 0.1272514 list 3
# 3 0.1125526 list 3
However, this gives minimum for every row.

Subsetting data frame by vector of elements

I spent about 20 minutes looking through previous questions, but could not find what I am looking for. I have a large data frame I want to subset down based on a list of names, but the names in the data frame can also have a postfix not indicated in the list.
In other words, is there a simpler generic way (for infinite numbers of postfixes) to do the following:
data <- data.frame("name"=c("name1","name1_post1","name2","name2_post1",
"name2_post2","name3","name4"),
"data"=rnorm(7,0,1),
stringsAsFactors=FALSE)
names <- c("name2","name3")
subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]
In response to #Arun's answer. The names in my data actually include more than one underscore, making the problem more complicated.
data <- data.frame("name"=c("name1_target_time","name1_target_time_post1","name2_target_time","name2_target_time_post1",
"name2_target_time_post2","name3_target_time","name4_target_time"),
"data"=rnorm(7,0,1),
stringsAsFactors=FALSE)
names <- c("name2_target_time","name3_target_time")
subset <- data[ data$name %in% names | data$name %in% paste0(names,"_post1") | data$name %in% paste0(names,"_post2") , ]
Edit: solution using regular expressions (following OP's follow-up in comment):
data[grepl(paste(names, collapse="|"), data$name), ]
# name data
# 3 name2 1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6 name3 0.4220084
On your new data:
# name data
# 3 name2_target_time 0.6295361
# 4 name2_target_time_post1 0.8951720
# 5 name2_target_time_post2 0.6602126
# 6 name3_target_time 2.2734835
Also, as #flodel shows under comments, this also works fine!
subset(data, sub("_post\\d+$", "", name) %in% names)
Old solution:
data[sapply(strsplit(data$name, "_"), "[[", 1) %in% names, ]
# name data
# 3 name2 1.4934931
# 4 name2_post1 -1.6070809
# 5 name2_post2 -0.4157518
# 6 name3 0.4220084
The idea: First split the string at _ using strsplit. This results in a list. For ex: name2 will result in just name2 (first element of the list). But name2_post1 will result in name2 and post1 (second element of the list). By wrapping it with sapply and using [[ with 1, we can select just the "first" element of this resulting list. Then we can use that with %in% to check if they are present in names (which is straightforward).
A grep solution would probably look something like the following:
subset <- data[grep("(name2)|(name3)",names(data)),]

Resources