Unlist a list of multiple elements in R - r

I have extracted data from twitter and it appears as the column List (what you will get after running the code). I want the output as what appears in the Broken column.
data <- data.frame(matrix(, nrow=4, ncol=2))
colnames(data)[1:2] <- c("List", "Broken")
data$List[1] <- 1
data$List[2] <- list(c("1", "SmythsToysUK"))
data$List[3] <- list(c("1", "FortniteGame", "CityCtrMirdif", "itpliveme"))
data$List[4] <- 1
data$Broken[1:4]<- c("SmythsToysUK","FortniteGame","CityCtrMirdif","itpliveme")

We can remove all the numbers from List column.
temp <- unlist(data$List)
data$Broken <- temp[is.na(as.numeric(temp))]
data
# List Broken
#1 1 SmythsToysUK
#2 1, SmythsToysUK FortniteGame
#3 1, FortniteGame, CityCtrMirdif, itpliveme CityCtrMirdif
#4 1 itpliveme

We can use grep with unlist. After unlisting the list, select only the elements that have letters
data$Broken <- grep("[A-Za-z]", unlist(data$List), value = TRUE)
data$Broken
#[1] "SmythsToysUK" "FortniteGame" "CityCtrMirdif" "itpliveme"
Or another option is to remove the first element which seems to be index and then unlist
unlist(sapply(data$List, `[`, -1))
NOTE: Both the options, doesn't have any warnings

Related

Separating a column by the first 3 characters

I have a set of data below and I would like to separate the first three characters from the bm_id column into a separate column with the rest of the characters in another column.
bm_id
1
popCL20TE
2
agrST20
3
agrST20-09SE
I have tried using solutions to a similar question asked on stack, however I end up making extra empty columns with my data remaining together.
bm_id[c('species', 'id')] <- tstrsplit(bm_id$bm_id, '(?<=.{3})', perl = TRUE)
same happens with this code
bm_id2 <- tidyr::separate(bm_id, bm_id, into = c("species", "id"), sep = 3)
How about substr
df <- data.frame(vec= c("popCL20TE", "agrST20"))
df$first3 <- substr(df$vec, 1, 3)
df$last <- substr(df$vec, 4, nchar(df$vec))
df
vec first3 last
1 popCL20TE pop CL20TE
2 agrST20 agr ST20

How to use grepl function multiple times, in R

I have a vector like go_id and a data.frame like data.
go_id <- c("[GO:0000086]", "[GO:0000209]", "[GO:0000278]")
protein_id <- c("Q96IF1","P26371","Q8NHG8","P60372","O75526","Q01130")
bio_process <- c("[GO:0000086]; [GO:0000122]; [GO:0000932]", "[GO:0005829]; [GO:0008544]","[GO:0000209]; [GO:0005737]; [GO:0005765]","NA","[GO:0000398]; [GO:0003729]","[GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]")
data <- as.data.frame(cbind(protein_id,bio_process))
How can I keep the rows of the data for which bio_process cell contains at least one of the go_ids elements? I note that the GO code can not be repeated in the same bio_process cell.
To be more precise, i would like to receive only the first, the third and the sixth row of the data.frame.
I have tried a for loop using 'grepl' function, like this:
go_id <- gsub("GO:","", go_id, fixed = TRUE)
for (i in 1:6) {
new_data <- data[grepl("\\[GO:go_id[i]\\]",data$Gene.ontology..biological.process.)]
}
Which I know it can not work because I can not fit in a variable value into a regular expression.
Any ideas on this?
Thank you
We can use Reduce with grepl
data$ind <- Reduce(`|`, lapply(go_id, function(pat)
grepl(pat, data$bio_process, fixed = TRUE)))
data
# protein_id bio_process ind
#1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932] TRUE
#2 P26371 [GO:0005829]; [GO:0008544] FALSE
#3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765] TRUE
#4 P60372 NA FALSE
#5 O75526 [GO:0000398]; [GO:0003729] FALSE
#6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714] TRUE
You should use fixed = TRUE in grepl() :
vect <- rep(FALSE, nrow(data))
for(id in go_id){
vect <- vect | grepl(id, data$bio_process, fixed = T)
}
data[vect,]
You can subset using str_extract to define the pattern on those substrings that are distinctive:
library(stringr)
data[grepl(paste(str_extract(go_id, "\\d{4}]"), collapse="|"), data$bio_process),]
protein_id bio_process
1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932]
3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765]
6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]
EDIT:
The most straighforward solution is subsetting with grepland paste0 to add the escape slashes for the metacharacter [:
data[grepl(paste0("\\", go_id, collapse="|"), data$bio_process),]

Get indexes from a vector with multiple elements based on a specific value

I have a vector:
lst <- c("2,1","7,10","11,0","7,0","10,0","1,1","1,0","4,0","4,1","0,1","6,0")
each element contains two numbers,separated by ",". I would like to get indexes of elements containing "1".
So the index list is expected:
1, 6, 7, 9, 10
grep() will work nicely for this. By default, it returns the indices of the matched pattern.
grep("^1,|,1$", lst)
# [1] 1 6 7 9 10
The regular expression ^1,|,1$ looks to match a string that
^1, = starts with 1,
| OR
,1$ = ends with ,1
each element contains two numbers. my answer is not ideal but I got what I need.
m <- as.numeric(unlist(lapply(strsplit(as.character(lst), "\\,"),"[[",1)))
n <- as.numeric(unlist(lapply(strsplit(as.character(lst), "\\,"),"[[",2)))
sort(unique(c(which(m==1),which(n==1))))
Depending on background and context of this task it might be prudent to turn this vector into a data.frame:
lst <- c("2,1","7,10","11,0","7,0","10,0","1,1","1,0","4,0","4,1","0,1","6,0")
DF <- read.table(text = do.call(paste, list(lst, collapse = "\n")), sep = ",")
which(DF$V1 == 1L | DF$V2 == 1L)
#[1] 1 6 7 9 10

Splitting a dataframe by column name indices

This is a variation of an earlier question.
df <- data.frame(matrix(rnorm(9*9), ncol=9))
names(df) <- c("c_1", "d_1", "e_1", "a_p", "b_p", "c_p", "1_o1", "2_o1", "3_o1")
I want to split the dataframe by the index that is given in the column.names after the underscore "_". (The indices can be any character/number in different lengths; these are just random examples).
indx <- gsub(".*_", "", names(df))
and name the resulting dataframes accordingly n the end i would like get three dataframes, called:
df_1
df_p
df_o1
Thank you!
Here, you can split the column names by indx, get the subset of data within the list using lapply and [, set the names of the list elements using setNames, and use list2env if you need them as individual datasets (not so recommended as most of the operations can be done within the list and later if you want, it can be saved using write.table with lapply.
list2env(
setNames(
lapply(split(colnames(df), indx), function(x) df[x]),
paste('df', sort(unique(indx)), sep="_")),
envir=.GlobalEnv)
head(df_1,2)
# c_1 d_1 e_1
#1 1.0085829 -0.7219199 0.3502958
#2 -0.9069805 -0.7043354 -1.1974415
head(df_o1,2)
# 1_o1 2_o1 3_o1
#1 0.7924930 0.434396 1.7388130
#2 0.9202404 -2.079311 -0.6567794
head(df_p,2)
# a_p b_p c_p
#1 -0.12392272 -1.183582 0.8176486
#2 0.06330595 -0.659597 -0.6350215
Or using Map. This is similar to the above approach ie. split the column names by indx and use [ to extract the columns, and the rest is as above.
list2env(setNames(Map(`[` ,
list(df), split(colnames(df), indx)),
paste('df',unique(sort(indx)), sep="_")), envir=.GlobalEnv)
Update
You can do:
indx1 <- factor(indx, levels=unique(indx))
split(colnames(df), indx1)
you can try this :
invisible(sapply(unique(indx),
function(x)
assign(paste("df",x,sep="_"),
df[,grepl(paste0("_",x,"$"),colnames(df))],
envir=.GlobalEnv)))
# the code applies to each unique element of indx the assignement (in the global environment)
# of the columns corresponding to indx in a new data.frame, named according to the indx.
# invisible function avoids that the data.frames are printed on screen.
> ls()
[1] "df" "df_1" "df_o1" "df_p" "indx"
> df_1[1:3,]
c_1 d_1 e_1
1 1.8033188 0.5578494 2.2458750
2 1.0095556 -0.4042410 -0.9274981
3 0.7122638 1.4677821 0.7770603
> df_o1[1:3,]
1_o1 2_o1 3_o1
1 -2.05854176 -0.92394923 -0.4932116
2 -0.05743123 -0.24143979 1.9060076
3 0.68055653 -0.70908036 1.4514368
> df_p[1:3,]
a_p b_p c_p
1 -0.2106823 -0.1170719 2.3205184
2 -0.1826542 -0.5138504 1.9341230
3 -1.0551739 -0.2990706 0.5054421

Appending list to data frame in R

I have created an empty data frame in R with two columns:
d<-data.frame(id=c(), numobs=c())
I would like to append this data frame (in a loop) with a list, d1 that has output:
[1] 1 100
I tried using rbind:
d<-rbind(d, d2)
and merge:
d<-merge(d, d2)
And I even tried just making a list of lists and then converting it to a data frame, and then giving that data frame names:
d<-rbind(dlist1, dlist2)
dframe<-data.frame(d)
names(dframe)<-c("id","numobs")
But none of these seem to meet the standards of a routine checker (this is for a class), which gives the error:
Error: all(names(cc) %in% c("id", "nobs")) is not TRUE
Even though it works fine in my workspace.
This is frustrating since the error does not reveal where the error is occurring.
Can anyone help me to either merge 2 data frames or append a data frame with a list?
I think you are confusing the purpose of rbind and merge. rbind appends data.frames or named lists, or both vertically. While merge combines data.frames horizontally.
You seem to be also confused by vector's and list's. In R, list can take different datatypes for each element, while vector has to have all elements the same type. Both list and vector are one-dimensional. When you use rbind you want to append a named list, not a named/unnamed vector.
Unnamed Vectors and Lists
The way you define a vector is with the c() function. The way you define an unnamed list is with the list() function, like so:
vec1 = c(1, 10)
# > vec1
# [1] 1 10
list1 = list(1, 10)
# > list1
# [[1]]
# [1] 1
#
# [[2]]
# [1] 10
Notice that both vec1 and list1 have two elements, but list1 is storing the two numbers as two separate vectors (element [[1]] the vector c(1) and [[2]] the vector c(10))
Named Vectors and Lists
You can also create named vectors and lists. You do this by:
vec2 = c(id = 1, numobs = 10)
# > vec2
# id numobs
# 1 10
list2 = list(id = 1, numobs = 10)
# > list2
# $id
# [1] 1
#
# $numobs
# [1] 10
Same data structure for both, but the elements are named.
Dataframes as Lists
Notice that list2 has a $ in front of each element name. This might give you some clue that data.frame's are actually list's with each column an element of the list, since df$column is often used to extract a column from a dataframe. This makes sense since both list's and data.frame's can take different datatypes, unlike vectors's.
The rbind function
When your first element is a dataframe, rbind requires that what you are appending has the same names as the columns of the dataframe. Now, a named vector would not work, because the elements of a vector are not treated as columns of a dataframe, whereas a named list matches elements with columns if the names are the same:
To demonstrate:
d<-data.frame(id=c(), numobs=c())
rbind(d, c(1, 10))
# X1 X10
# 1 1 10
rbind(d, c(id = 1, numobs = 10))
# X1 X10
# 1 1 10
rbind(d, list(1, 10))
# X1 X10
# 1 1 10
rbind(d, list(id = 1, numobs = 10))
# id numobs
# 1 1 10
Knowing the above, it is obvious that you can most certainly also rbind two dataframes with column names that match:
df2 = data.frame(id = 1, numobs = 10)
rbind(d, df2)
# id numobs
# 1 1 10
For starters, the routine checker appears to be looking for columns labeled "id" and "nobs". If that doesn't match your file output, you'll get that error.
I'm taking what is probably the same class and had the same error; correcting my column names made that go away (I'd labeled the 2nd one "nob" not "nobs"!) Now I've gotten the routine checker to complete correctly, or so it seems... but it outputs three data files, and the first and last files are correct but the second one yields "Sorry, that is incorrect." No further feedback. Maddening!
No point posting my code here as it runs fine locally with all the course examples, and it's kinda hard to debug when you don't know what the script is asking for. Sigh.
That d2 object is being printed as an atomic vector would be. Maybe if you showed us either dput(d2) or str(d2) you would havea better understanding of R lists. Furthermore that first bit of code does not produce a two column dataframe, either.
> d<-data.frame(id=1, numobs=1)[0, ] # 2-cl dataframe with 0 rows
> dput(d)
structure(list(id = numeric(0), numobs = numeric(0)), .Names = c("id",
"numobs"), row.names = integer(0), class = "data.frame")
> d2 <- list(id="fifty three", numobs=6) # names that match names(d)
> rbind(d,d2)
id numobs
2 fifty three 6

Resources