Hello I would like to select rows in form of list from a dataframe. Here is my dataframe:
df2 <- data.frame("user_id" = 1:2, "username" = c(215,154), "password" = c("John4","Dora4"))
now with this dataframe I can only select 1 column to view rows as a list, which I did with this code
df2[["user_id"]]
output is
[1] 1 2
but now when I try this with more columns I am told its out of bounds, what is the problem here
df2[["user_id", "username"]]
How can I resolve and get the results of rows as a list
If I understood your question correctly, you need to familiarize yourself with subsetting in R. These are ways to select multiple columns in R:
df2[,c('user_id', 'username')]
or
df2[,1:2]
If you want to return all columns as a list, you can use something like this:
lapply(1:ncol(df2), function(x) df2[,x])
The format is df2['rows','columns'], so you should use:
df2[,c("user_id", "username")]
To get them 'in form of list', do:
as.list(df2[,c("user_id", "username")])
The double bracket [[ notion is used to select a single unnamed element (in this case a single unnamed column since data frames are essentially lists of column data).
See this answer for more on double vs single bracket notion: https://stackoverflow.com/a/1169495/8444966
This should give you a row of list (There's got to be an answer somewhere here).
row_list<- as.list(as.data.frame(t(df2[c("user_id", "username")])))
#$V1
#[1] 1 215
#$V2
#[1] 2 154
If you want to keep names of the rows.
df2_subset <- df2[c("user_id", "username")]
setNames(split(df2_subset, seq(nrow(df2_subset))), rownames(df2_subset))
#$`1`
# user_id username
#1 1 215
#$`2`
# user_id username
#2 2 154
Related
I have column having values like "xxxxTxxx" or "xxTxx", always separated by 'T', I i want to extract the first part of the string, that is prior to 'T', AND, saved it in another column.
a <- c("abcT123","Dsds1Tdf4")
i get to get a table with 3 columns as below:
a b c
abcT123 abc 123
Dsds1Tdf4 Dsds Tdf4
Can you please help?
Try
cbind(a,do.call(rbind,strsplit(a,"T")))
Result:
a
[1,] "abcT123" "abc" "123"
[2,] "Dsds1Tdf4" "Dsds1" "df4"
Look at ?strsplit.
I have one list of vectors of people's names, where each vector just has the first and last name and I have another list of vectors, where each vector has the first, middle, last names. I need to match the two lists to find people who are included in both lists. Because the names are not in order (some vectors have the first name as the first value, while others have the last name as the first value), I would like to match the two vectors by finding which vector in the second list (full name) contains all the values of a vector in the first list (first and last names only).
What I have done so far:
#reproducible example
first_last_names_list <- list(c("boy", "boy"),
c("bob", "orengo"),
c("kalonzo", "musyoka"),
c("anami", "lisamula"))
full_names_list <- list(c("boy", "juma", "boy"),
c("stephen", "kalonzo", "musyoka"),
c("james", "bob", "orengo"),
c("lisamula", "silverse", "anami"))
First, I tried to make a function that checks whether one vector is contained in another vector (heavily based on the code from here).
my_contain <- function(values,x){
tx <- table(x)
tv <- table(values)
z <- tv[names(tx)] - tx
if(all(z >= 0 & !is.na(z))){
paste(x, collapse = " ")
}
}
#value would be the longer vector (from full_name_list)
#and x would be the shorter vector(from first_last_name_list)
Then, I tried to put this function within sapply() so that I can work with lists and that's where I got stuck. I can get it to see whether one vector is contained within a list of vectors, but I'm not sure how to check all the vectors in one list and see if it is contained within any of the vectors from a second list.
#testing with the first vector from first_last_names_list.
#Need to make it run through all the vectors from first_last_names_list.
sapply(1:length(full_names_list),
function(i) any(my_contain(full_names_list[[i]],
first_last_names_list[[1]]) ==
paste(first_last_names_list[[1]], collapse = " ")))
#[1] TRUE FALSE FALSE FALSE
Lastly- although it might be too much to ask in one question- if anyone could give me any pointers on how to incorporate agrep() for fuzzy matching to account for typos in the names, that would be great! If not, that's okay too, since I want to get at least the matching part right first.
Since you are dealing with lists it would be better to collapse them into vectors to be easy to deal with regular expressions. But you just arrange them in ascending order. In that case you can easily match them:
lst=sapply(first_last_names_list,function(x)paste0(sort(x),collapse=" "))
lst1=gsub("\\s|$",".*",lst)
lst2=sapply(full_names_list,function(x)paste(sort(x),collapse=" "))
(lst3 = Vectorize(grep)(lst1,list(lst2),value=T,ignore.case=T))
boy.*boy.* bob.*orengo.* kalonzo.*musyoka.* anami.*lisamula.*
"boy boy juma" "bob james orengo" "kalonzo musyoka stephen" "anami lisamula silverse"
Now if you want to link first_name_last_name_list and full_name_list then:
setNames(full_names_list[ match(lst3,lst2)],sapply(first_last_names_list[grep(paste0(names(lst3),collapse = "|"),lst1)],paste,collapse=" "))
$`boy boy`
[1] "boy" "juma" "boy"
$`bob orengo`
[1] "james" "bob" "orengo"
$`kalonzo musyoka`
[1] "stephen" "kalonzo" "musyoka"
$`anami lisamula`
[1] "lisamula" "silverse" "anami"
where the names are from first_last_list and the elements are full_name_list. It would be great for you to deal with character vectors rather than lists:
Edit I've modified the solution to satisfy the constraint that a repeated name such as 'John John' should not match against 'John Smith'.
apply(sapply(first_last_names_list, unlist), 2, function(x){
any(sapply(full_names_list, function(y) sum(unlist(y) %in% x) >= length(x)))
})
This solution still uses %in% and the apply functions, but it now does a kind of reverse search - for every element in the first_last names it looks at
how many words in each name within the full_names list are matched. If this number is greater than or equal to the number of words in the first_list names item under consideration (always 2 words in your examples, but the code will work for any number), it returns TRUE. This logical array is then aggregated with ANY to pass back single vector showing if each first_last is matched to any full_name.
So for example, 'John John' would not be matched to 'John Smith Random', as only 1 of the 3 words in 'John Smith Random' are matched. However, it would be matched to 'John Adam John', as 2 of the 3 words in 'John Adam John' are matched, and 2 is equal to the length of 'John John'. It would also match to 'John John John John John' as 5 of the 5 words match, which is greater than 2.
Instead of my_contain, try
x %in% values
Maybe also unlist and work with data frames? Not sure if you considered it--might make things easier:
# unlist to vectors
fl <- unlist(first_last_names_list)
fn <- unlist(full_names_list)
# grab individual names and convert to dfs;
# assumptions: first_last_names_list only contains 2-element vectors
# full_names_list only contains 3-element vectors
first_last_df <- data.frame(first_fl=fl[c(T, F)],last_fl=fl[c(F, T)])
full_name_df <- data.frame(first_fn=fn[c(T,F,F)],mid_fn=fn[c(F,T,F)],last_fn=fn[c(F,F,T)])
Or you could do this:
first_last_names_list <- list(c("boy", "boy"),
c("bob", "orengo"),
c("kalonzo", "musyoka"),
c("anami", "lisamula"))
full_names_list <- list(c("boy", "juma", "boy"),
c("stephen", "kalonzo", "musyoka"),
c("james", "bob", "orengo"),
c("lisamula", "silverse", "anami"),
c("musyoka", "jeremy", "kalonzo")) # added just to test
# create copies of full_names_list without middle name;
# one list with matching name order, one with inverted order
full_names_short <- lapply(full_names_list,function(x){x[c(1,3)]})
full_names_inv <- lapply(full_names_list,function(x){x[c(3,1)]})
# check if names in full_names_list match either
full_names_list[full_names_short %in% first_last_names_list | full_names_inv %in% first_last_names_list]
In this case %in% does exactly what you want it to do, it checks if the complete name vector matches.
I want to look for individual character strings [Sequences], in another data.frame [Database] that has 500+ rows.
For example:
Sequences<-c("AzzY","BbDe")
Database<-c("TTUAzzY","aaa","DBbDe","CAzzY")
Ideally, the code would iterate through each row of [Database] to find if there is a match with one of the [Sequences]. From the example above, there would be 2 matches for "AzzY" and 1 match for "BbDe". I would like this count to be added in a new column of [Sequences].
Many thanks.
require("dplyr")
Sequences=c("AzzY","BbDe")
Database=c("TTUAzzY","aaa","DBbDe","CAzzY")
df=as.data.frame(sapply(Sequences, function(x) grepl(x,Database)))
stats=df %>% summarise_each(funs(sum))
cbind(Sequences,as.numeric(stats))
Sequences=c("AzzY","BbDe")
Database=c("TTUAzzY","aaa","DBbDe","CAzzY")
sapply(Sequences, function(x) length(grep(x, Database)))
# AzzY BbDe
# 2 1
Another idea would be to "flatten" the Database vector and search in that.
Here's that idea in practice with the efficient "stringi" package:
library(stringi)
stri_count_fixed(stri_flatten(Database), Sequences)
# [1] 2 1
You can pretty-up the output and add names if you want with setNames:
setNames(stri_count_fixed(stri_flatten(Database), Sequences), Sequences)
# AzzY BbDe
# 2 1
I have a dataframe in R, and a column called created_at which holds a text which I want to parse into a datetime. Here is a snappy preview:
head(pushes)
created_at repo.url repository.url
1 2013-06-17T00:14:04Z https://github.com/Mindful/blog
2 2013-07-31T21:08:15Z https://github.com/leapmotion/js.leapmotion.com
3 2012-11-04T07:08:15Z https://github.com/jplusui/jplusui
4 2012-06-21T08:16:22Z https://github.com/LStuker/puppet-rbenv
5 2013-03-10T09:15:51Z https://github.com/Fchaubard/CS108FinalProject
6 2013-10-04T11:34:11Z https://github.com/cmmurray/soccer
actor.login payload.actor actor_attributes.login
1 Mindful
2 joshbuddy
3 xuld
4 LStuker
5 ststanko
6 cmmurray
I wrote an instructions which works ok with some test data:
xts::.parseISO8601("2012-06-17T00:14:04",tz="UTC")$first.time returns proper Posix date
But when I apply it to a column with this instruction:
pushes$created_at <- xts::.parseISO8601(substr(pushes$created_at,1,nchar(pushes$created_at)-1),tz="UTC")$first.time
every row in a dataframe gets a duplicated date 2012-06-17 00:14:04 UTC
Like the function runned only once for the first row and then result was duplicated in rest of the rows :( Can you please help me to apply it properly row per row in a created_at column ?
Thanks.
The first argument to .parseISO8601 is supposed to be a character string, not a vector. You need to use sapply (or equivalent) to loop over your vector.
created_at <-
c("2013-06-17T00:14:04Z", "2013-07-31T21:08:15Z", "2012-11-04T07:08:15Z",
"2012-06-21T08:16:22Z", "2013-03-10T09:15:51Z", "2013-10-04T11:34:11Z")
# Only parses first element
.parseISO8601(substr(created_at,1,nchar(created_at)-1),tz="UTC")$first.time
# [1] "2013-06-17 00:14:04 UTC"
firstParseISO8601 <- function(x) .parseISO8601(x,tz="UTC")$first.time
# parse all elements
datetimes <- sapply(sub("Z$","",created_at), firstParseISO8601, USE.NAMES=FALSE)
# note that "simplifying" the output strips the POSIXct class, so we re-add it
datetimes <- .POSIXct(datetimes, tz="UTC")
I would like to address the rows of a data frame by a string name, and the table will be built sequentially. I want to do something like
> mytab <- data.frame(city=c("tokyo","delhi","lima"),price=c(9,8,7),row.names=1)
> mytab
price
tokyo 9
delhi 8
lima 7
> # I can add a new row
> mytab["london",] = 8.5
I now need to check whether a row name already exists.
> mytab["ny",]
[1] NA
Is there anything better that I can do other than
> if (is.na(mytab["ny",])) { mytab["ny",]=9;}
since a NA may possibly arise otherwise?
Something like this
if (!('ny' %in% row.names(mytab))) {mytab['ny',]=9}
might do the trick.
There are plenty of ways to do this. One of the easiest is to use the any() function like this:
# Returns true if any of the row names is 'lima', false otherwise.
any(row.names(mytab) == 'lima')
Since this returns a boolean, you can branch conditionals from it as you please.
Here's a slightly different approach if you want to check several cities in one go. That could help speed things up...
mytab <- data.frame(city=c("tokyo","delhi","lima"),price=c(9,8,7),row.names=1)
# Check several cities in one go:
newcities <- c('foo', 'delhi', 'bar')
# Returns the missing cities:
setdiff(newcities, row.names(mytab))
#[1] "foo" "bar"
# Returns the existing cities:
intersect(newcities, row.names(mytab))
#[1] "delhi"