I am having a difficult time to "sub-setting" a list.
For example,
test <- data.frame(x = c("5353-66", "55-110-4000","6524-533", "62410-165", "653-520-2410"))
test$x <- as.character(test$x)
strsplit(test$x, "-")
strsplit gives me a list as below:
[[1]]
[1] "5353" "66"
[[2]]
[1] "55" "110" "4000"
[[3]]
[1] "6524" "533"
[[4]]
[1] "62410" "165"
[[5]]
[1] "653" "520" "2410"
When I run lapply(strsplit(test$x, "-"), "[[", 1), it gives me the first character string from each component of the list as below:
[[1]]
[1] "5353"
[[2]]
[1] "55"
[[3]]
[1] "6524"
[[4]]
[1] "62410"
[[5]]
[1] "653"
Then... How do I select entire [[1]] and [[2]] and [[3]]... separately?
For example, I want to assign test$y[1] as c("5353", "66") and test$y[2] as c("55" , "110" , "4000") and so on.
test$y <- lapply(strsplit(test$x, "-"), "[", 1)
Above line gave me the same result.
While it can get messy it's also fairly easy to do. You were on the right track but adding an unlist() and using strsplit() with the lapply() will get you what you want.
test$y <- lapply(1:length(test$x),function(i) unlist(strsplit(test$x[[i]],"-")))
test$y[[1]]
[1] "5353" "66"
This is where the magic of sapply comes in handy --
test <- data.frame(x = c("5353-66", "55-110-4000","6524-533", "62410-165", "653-520-2410"))
test$x <- as.character(test$x)
sapply(test$x,strsplit,'-')
$`5353-66`
[1] "5353" "66"
$`55-110-4000`
[1] "55" "110" "4000"
$`6524-533`
[1] "6524" "533"
$`62410-165`
[1] "62410" "165"
$`653-520-2410`
[1] "653" "520" "2410"
What you do with the data from here is up to you. Because your data is ragged, i.e., it will not fit into a rectangular matrix or data frame that needs a fixed number of cells per row, you should keep the data as a list. Data frames in fact are lists, so many of the data frame functions work on them as well.
If you must have a data frame, you can add on NAs for missing cells and then convert it back to a data frame in wide format:
out_list <- sapply(test$x,strsplit,'-')
max_length <- max(sapply(out_list,length))
out_list <- lapply(out_list, function(x) {
if(length(x)<max_length) {
x <- c(x,rep(NA,times=max_length-length(x)))
}
return(x)
})
out_data <- as.data.frame(out_list)
X5353.66 X55.110.4000 X6524.533 X62410.165 X653.520.2410
1 5353 55 6524 62410 653
2 66 110 533 165 520
3 <NA> 4000 <NA> <NA> 2410
Related
I have a list of data:
$nPerm
[1] "1000"
$minGSSize
[1] "10"
$maxGSSize
[1] "100"
$by
[1] "DOSE"
$seed
[1] "TRUE"
This list is supposed to be flexible, so these values could be different and could be something else.
All the data in this list is in character class, the numbers and words also. I would like to know if it is possible to convert only the numbers to numeric, but leave the others as characters/strings.
Thank you in advance!
L <- list(a="1000", b="DOSE", c="99")
type.convert(L, as.is = TRUE)
# $a
# [1] 1000
# $b
# [1] "DOSE"
# $c
# [1] 99
Evan's answer is very neat, just for completeness also a {purrr} option:
L <- list(a="1000", b="DOSE", c="99")
L |> purrr::map(~ifelse(stringr::str_detect(.x,"^[:digit:]+$"), as.numeric(.x), .x))
I'm working on speaking turns in conversation. My interest is in the words that get repeated from a prior turn to a next turn:
turnsX <- data.frame(
speaker = c("A","B","A","B"),
speech = c("let's have a look",
"yeah let's take a look",
"yeah okay so where to start",
"let's start here"), stringsAsFactors = F
)
I want to extract the repeated word forms. To this end I've run a for loop, iteratively defining each speech turn as a regex pattern for the next speech turn and str_extracting the words that get repeated from turn to turn:
library(stringr)
pattern <- c()
extracted <- c()
for(i in 1:nrow(turnsX)){
pattern[i] <- paste0(unlist(str_split(turnsX$speech[i], " ")), collapse = "|")
extracted[i+1] <- str_extract_all(turnsX$speech[i+1], pattern[i])
}
The result however is partly incorrect:
extracted
[[1]]
NULL
[[2]]
[1] "a" "let's" "a" "a" "look"
[[3]]
[1] "yeah" "a" "a"
[[4]]
[1] "start"
[[5]]
[1] NA
The correct result should be:
extracted
[[1]]
NULL
[[2]]
[1] "let's" "a" "look"
[[3]]
[1] "yeah"
[[4]]
[1] "start"
Where's the mistake? How can the code be mended, or what other approach is there, to get the correct result?
Maybe you can use Map and %in%.
x <- strsplit(turnsX$speech, " ")
Map(function(y,z) y[y %in% z], x[-length(x)], x[-1])
#[[1]]
#[1] "let's" "a" "look"
#
#[[2]]
#[1] "yeah"
#
#[[3]]
#[1] "start"
Here's a base R approach using Map :
tmp <- strsplit(turnsX$speech, ' ')
c(NA, Map(intersect, tmp[-1], tmp[-length(tmp)]))
#[[1]]
#[1] NA
#[[2]]
#[1] "let's" "a" "look"
#[[3]]
#[1] "yeah"
#[[4]]
#[1] "start"
You want the word boundaries "\\b"
library(stringr)
pattern <- c()
extracted <- c()
for(i in 2:nrow(turnsX)){
pattern[i - 1] <- paste0(unlist(str_split(turnsX$speech[i - 1], " ")), collapse = "|\\b")
extracted[i] <- str_extract_all(turnsX$speech[i], pattern[i - 1])
}
# [[1]]
# NULL
#
# [[2]]
# [1] "let's" "a" "look"
#
# [[3]]
# [1] "yeah"
#
# [[4]]
# [1] "start"
I have a vector of character strings (v1) like so:
> head(v1)
[1] "do_i_need_to_even_say_it_do_i_well_here_i_go_anyways_chris_cornell_in_chicago_tonight"
[2] "going_to_see_harry_sunday_happiness"
[3] "this_motha_fucka_stay_solid_foh_with_your_naieve_ass_mentality_your_synapsis_are_lacking_read_a_fucking_book_for_christ_sake"
[4] "why_twitter_will_soon_become_obsolete_http_www.imediaconnection.com_content_23465_asp"
[5] "like_i_said_my_back_still_fucking_hurts_and_im_going_to_complain_about_it_like_no_ones_business_http_tumblr.com_x6n25amd5"
[6] "my_picture_with_kris_karmada_is_gone_forever_its_not_in_my_comments_on_my_mysapce_or_on_my_http_tumblr.com_xzg1wy4jj"
And another vector of character strings (v2) like so:
> head(v2)
[1] "here_i_go" "going" "naieve_ass" "your_synapsis" "my_picture_with" "roll"
What is the quickest way that I can return a list of vectors where each list item represents each vector item in v1 and each vector item is a regular expression match where an item in v2 appeared in that v1 item, like so:
[[1]]
[1] "here_i_go"
[[2]]
[1] "going"
[[3]]
[1] "naieve_ass" "your_synapsis"
[[4]]
[[5]]
[1] "going"
[[6]]
[1] "my_picture_with"
I'd like to leave another option with stri_extract_all_regex() in the stringi package. You can create your regular expression directly from v2 and use it in pattern.
library(stringi)
stri_extract_all_regex(str = v1, pattern = paste(v2, collapse = "|"))
[[1]]
[1] "here_i_go"
[[2]]
[1] "going"
[[3]]
[1] "naieve_ass" "your_synapsis"
[[4]]
[1] NA
[[5]]
[1] "going"
[[6]]
[1] "my_picture_with"
If you want speed, I'd use stringi. You don't seem to have any regex, just fixed patterns, so we can use a fixed stri_extract, and (since you don't mention what to do with multiple matches) I'll assume only extracting the first match is fine, giving us a little more speed with stri_extract_first_fixed.
It's probably not worth benchmarking on such a small example, but this should be quite fast.
library(stringi)
matches = lapply(v1, stri_extract_first_fixed, v2)
lapply(matches, function(x) x[!is.na(x)])
# [[1]]
# [1] "here_i_go"
#
# [[2]]
# [1] "going"
#
# [[3]]
# [1] "naieve_ass" "your_synapsis"
#
# [[4]]
# character(0)
#
# [[5]]
# [1] "going"
Thanks for sharing data, but next time please share it copy/pasteably. dput is nice for that. Here's a copy/pasteable input:
v1 = c(
"do_i_need_to_even_say_it_do_i_well_here_i_go_anyways_chris_cornell_in_chicago_tonight" ,
"going_to_see_harry_sunday_happiness" ,
"this_motha_fucka_stay_solid_foh_with_your_naieve_ass_mentality_your_synapsis_are_lacking_read_a_fucking_book_for_christ_sake",
"why_twitter_will_soon_become_obsolete_http_www.imediaconnection.com_content_23465_asp" ,
"like_i_said_my_back_still_fucking_hurts_and_im_going_to_complain_about_it_like_no_ones_business_http_tumblr.com_x6n25amd5" ,
"my_picture_with_kris_karmada_is_gone_forever_its_not_in_my_comments_on_my_mysapce_or_on_my_http_tumblr.com_xzg1wy4jj")
v2 = c("here_i_go", "going", "naieve_ass", "your_synapsis", "my_picture_with", "roll" )
I have 2 lists, I want to check if the second list in the first list, if yes, paste letters "a","b"... to each element in the first list
list1 <- list("Year","Age","Enrollment","SES","BOE")
list2 <- list("Year","Enrollment","SES")
I try to use lapply
text <- letters[1:length(list2)]
listText<- lapply(list1,function(i) ifelse(i %in% list2,paste(i,text[i],sep="^"),i))
I got wrong output
> listText
[[1]]
[1] "Year^NA"
[[2]]
[1] "Age"
[[3]]
[1] "Enrollment^NA"
[[4]]
[1] "SES^NA"
[[5]]
[1] "BOE"
This is the output I want
[[1]]
[1] "Year^a"
[[2]]
[1] "Age"
[[3]]
[1] "Enrollment^b"
[[4]]
[1] "SES^c"
[[5]]
[1] "BOE"
We can use match to find the index and then use it to subset the first list and paste the letters
i1 <- match(unlist(list2), unlist(list1))
list1[i1] <- paste(list1[i1], letters[seq(length(i1))], sep="^")
You just need change to :
text <- as.character(letters[1:length(list2)])
names(text) <- unlist(list2)
The result is :
> listText
[[1]]
[1] "Year^a"
[[2]]
[1] "Age"
[[3]]
[1] "Enrollment^b"
[[4]]
[1] "SES^c"
[[5]]
[1] "BOE"
I have a data frame consisting of records like the following. A typical row of the data frame, df[1,] looks as follows
84745,"F",70,7,"Single",2,"N",4,9,1,1,3,4,4,"2 day","<120 and <80",0,8,0,1,1,1,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1
I want to convert it into a variable like myvar below which is of the following type
myvar = list( list(84745,"F",70,7,"Single",2,"N",4,9,1,1,3,4,4,"2 day","<120 and <80",0,8,0,1,1,1,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1))
I have tried doing the following, but it doesn't work to convert it to a list of lists.
myvar <- as.list(as.list(as.data.frame(t(df[1,]))))
How can I do that?
EDIT : I have tried myvar = list(unclass(df[14,])). It however fails the call since the formatting of the output myvar is slightly different.
Format of the original line of code
[[1]]
[[1]][[1]]
[1] 21408
[[1]][[2]]
[1] "M"
[[1]][[3]]
[1] 69
[[1]][[4]]
[1] 3
[[1]][[5]]
[1] "Widowed"
Format of myvar = list(unclass(df[14,]))
[[1]]
[[1]]$ID
[1] "21408"
[[1]]$GenderCD
[1] "M"
[[1]]$Age
[1] "69"
[[1]]$LOS
[1] "3"
[[1]]$MaritalStatus
[1] "Widowed"
Try this:
myvar <- list( unclass( df[1,] )
Explanation: df[1,] is actually still a list but with a "data.frame" class attribute. If you remove its class it's now just an ordinary list. When you conducted the t(df[1,]-operation you forced that row to become a column vector which in a dataframe needed to all be the came class so coercion occurred.
If the goal is a row-by-row solution then do this:
myvar <- list()
for (i in seq(nrow(df)) ) { myvar[[i]] <- unclass( df[i,] )}
If it also needs to be unnamed which I rather doubt but I suppose it's possible then:
myvar <- list()
for (i in seq(nrow(df)) ) { myvar[[i]] <- unname( unclass( df[i,] )) }
I tested the unname strategy with:
> unname(unclass( data.frame(a=345,b="tyt")[1,]))
[[1]]
[1] 345
[[2]]
[1] tyt
Levels: tyt
attr(,"row.names")
[1] 1