Filtering id based on condition - r

I have data set with Ids for example
19878, 19659, 19855, 18658, 18996, 18002
I want to filter the IDs based on the number in ID. For example, I want to filter data with ID having 9 in number two position of ID i.e. 19878, 19659, 19855 etc.

try this :
data <- c(19878, 19659, 19855, 18658, 18996, 18002)
Extract your 2nd position of each ID in "data" with substr() :
substr(data,2,2)
[1] "9" "9" "9" "8" "8" "8"
Find out with grepl() which IDs contain a 9 at the 2nd position :
grepl(9,substr(data,2,2))
[1] TRUE TRUE TRUE FALSE FALSE FALSE
Cross your result with your "data" object :
data[grepl(9,substr(data,2,2))]
[1] 19878 19659 19855
Edit :
Faster solution by Gregor (removing grepl step):
data[substr(data,2,2) == "9"]

I did that by selecting the ID less than or equal to 19000 to separate IDs with 9in second position.
xam8=exam[exam[,1]<19000,] ##### selection of 8
exam9=exam[exam[,1]>19000,] ##### selection of 9
But the answers posted here are useful if specific IDS are needed from whole random data

Related

How to vectorize a for loop in R for a large dataset

I'm relatively new to R and I have a question about data processing. The main issue is that the dataset is too big, and I want to write a vectorized function that's faster than a for loop, but I don't know how. The data is about movies and user ratings, is formatted like this (below).
1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
The 1: and 2: represent movies, while the other lines represent a user id, user rating and dating of rating for that movie (in that order from left to right, separated by commas). I want to format the data as an edge list, like this:
Movie | User
1: | 5
1: | 1
1: | 3
2: | 2
2: | 3
2: | 5
I wrote the code below to perform this function. Basically, for every row, it check if its a movie id (containing ':') or if it's user data. It then combines the movie id and user id as two columns for every movie and user, and then rowbinds it to a new data frame. At the same time, it also only binds those users who rate a movie 5 out of 5.
el <- data.frame(matrix(ncol = 2, nrow = 0))
for (i in 1:nrow(data))
{
if (grepl(':', data[i,]))
{
mid <- data[i,]
} else(grepl(',', data[i,]))
{
if(grepl(',5,', data[i,]))
{
uid <- unlist(strsplit(data[i,], ','))[1]
add <- c(mid, uid)
el <- rbind(el, add)
}
}
}
However, I have about 100 million entries, and the for loop runs throughout the night without being able to complete. Is there a way to speed this up? I read about vectorization, but I can't figure out how to vectorize this function. Any help?
You can do this with a few regular expressions, for which I'll use the stringr package, as well as na.locf from the zoo package. (You'll have to install stringr and zoo first).
First we'll set up your data, which it sounds like is in a one-column data frame:
data <- read.table(textConnection("1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
"))
You can then follow the following steps (explanation in comments).
# Pull out the column as a character vector for simplicity
lines <- data[[1]]
library(stringr)
# Figure out which lines represent movie IDs, and extract IDs
movie_ids <- str_match(lines, "(\\d+):")[, 2]
# Fill the last observation carried forward (locf), to find out
# the most recent non-NA value
library(zoo)
movie_ids_filled <- na.locf(movie_ids)
# Extract the user IDs
user_ids <- str_match(lines, "(\\d+),")[, 2]
# For each line that has a user ID, match it to the movie ID
result <- cbind(movie_ids_filled[!is.na(user_ids)],
user_ids[!is.na(user_ids)])
This gets the result
[,1] [,2]
[1,] "1" "5"
[2,] "1" "1"
[3,] "1" "3"
[4,] "2" "2"
[5,] "2" "3"
[6,] "2" "5"
The most important part of this code is the use of regular expressions, particularly the capturing groups in parentheses of "(\\d+):" and (\\d+),. For more on using str_match with regular expressions, do check out this guide.

List Rows from column selection

Hello I would like to select rows in form of list from a dataframe. Here is my dataframe:
df2 <- data.frame("user_id" = 1:2, "username" = c(215,154), "password" = c("John4","Dora4"))
now with this dataframe I can only select 1 column to view rows as a list, which I did with this code
df2[["user_id"]]
output is
[1] 1 2
but now when I try this with more columns I am told its out of bounds, what is the problem here
df2[["user_id", "username"]]
How can I resolve and get the results of rows as a list
If I understood your question correctly, you need to familiarize yourself with subsetting in R. These are ways to select multiple columns in R:
df2[,c('user_id', 'username')]
or
df2[,1:2]
If you want to return all columns as a list, you can use something like this:
lapply(1:ncol(df2), function(x) df2[,x])
The format is df2['rows','columns'], so you should use:
df2[,c("user_id", "username")]
To get them 'in form of list', do:
as.list(df2[,c("user_id", "username")])
The double bracket [[ notion is used to select a single unnamed element (in this case a single unnamed column since data frames are essentially lists of column data).
See this answer for more on double vs single bracket notion: https://stackoverflow.com/a/1169495/8444966
This should give you a row of list (There's got to be an answer somewhere here).
row_list<- as.list(as.data.frame(t(df2[c("user_id", "username")])))
#$V1
#[1] 1 215
#$V2
#[1] 2 154
If you want to keep names of the rows.
df2_subset <- df2[c("user_id", "username")]
setNames(split(df2_subset, seq(nrow(df2_subset))), rownames(df2_subset))
#$`1`
# user_id username
#1 1 215
#$`2`
# user_id username
#2 2 154

How to remove constant parts of a string in R

I would like to remove constant (shared) parts of a string automatically and retain the variable parts.
e.g. i have a column with the following:
D20181116_Basel-Take1_digital
D20181116_Basel-Take2_digital
D20181116_Basel-Take3_digital
D20181116_Basel-Take4_digital
D20181116_Basel-Take5_digital
D20181116_Basel-Take5a_digital
how can i get automatically to for any similar column (here removing: "D20181116_Basel-Take" and "_digital"). But the code should be find the constant part itself and remove them.
1
2
3
4
5
5a
I hope this is clear. Thank you very much.
You can do it with a regex: it will remove everything before 'Take' and after the underscore character:
vec<- c("D20181116_Basel-Take1_digital",
"D20181116_Basel-Take2_digital",
"D20181116_Basel-Take3_digital",
"D20181116_Basel-Take4_digital",
"D20181116_Basel-Take5_digital",
"D20181116_Basel-Take5a_digital")
sub(".*?Take(.*?)_.*", "\\1", vec)
[1] "1" "2" "3" "4" "5" "5a"
with gsub():
assuming you have a dataframe df and want to change column
df$column <- gsub("^D20181116_Basel-Take","",df$column)
df$column <- gsub("_digital$","",df$column)

R - filter a list based on names which start with a numerical value

I have the following list in R:
x <- list("a"="m","a2"="test","001"="test2","002"="test3")
$a
[1] "m"
$a2
[1] "test"
$`001`
[1] "test2"
$`002`
[1] "test3"
I want to filter this list so that it returns only the items which begin with a number, i.e. it would return:
x$001 and x$002
Peter hasn't picked it up yet, so I'll post my comment as an answer. We can use the regex pattern "^[0-9]" to find strings that start with a number. Applying that to the names of your list:
x[grepl("^[0-9]", names(x))]
# $`001`
# [1] "test2"
#
# $`002`
# [1] "test3"
Not exactly sure what you mean here, but two possibilities that take advantage of the fact that you can filter a list by supplying a vector within single brackets
If what you want is elements of the list that have numbers in them:
x[sapply(x, function(i){grepl("[0-9]", i)})]
If what you want is elements of the list that have a name that can be interpreted as a number:
x[!is.na(as.numeric(names(x)))]

Delete a row in a dataframe and get a dataframe back

I want to "subset" this dataframe and remove the second row using the rowname
myDataFrame <- as.data.frame(rnorm(5))
rownames(MyDataFrame)
#"1" "2" "3" "4" "5"
myDataFrame[-2,]
# 0.2706859 0.9708845 0.7559821 -0.2063368
I want to be able to get the results above, but in a data frame form (with the original row names). I looked around and it seems the way to select by rowname is to use the which function, but I'm not sure how it would work in this context.
You can add an argument drop = FALSE.
> mydf[-2, , drop = FALSE]
rnorm(5)
1 1.9602780
3 0.1078827
4 -0.8517422
5 -0.8300695

Resources