I want to "subset" this dataframe and remove the second row using the rowname
myDataFrame <- as.data.frame(rnorm(5))
rownames(MyDataFrame)
#"1" "2" "3" "4" "5"
myDataFrame[-2,]
# 0.2706859 0.9708845 0.7559821 -0.2063368
I want to be able to get the results above, but in a data frame form (with the original row names). I looked around and it seems the way to select by rowname is to use the which function, but I'm not sure how it would work in this context.
You can add an argument drop = FALSE.
> mydf[-2, , drop = FALSE]
rnorm(5)
1 1.9602780
3 0.1078827
4 -0.8517422
5 -0.8300695
Related
I'm relatively new to R and I have a question about data processing. The main issue is that the dataset is too big, and I want to write a vectorized function that's faster than a for loop, but I don't know how. The data is about movies and user ratings, is formatted like this (below).
1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
The 1: and 2: represent movies, while the other lines represent a user id, user rating and dating of rating for that movie (in that order from left to right, separated by commas). I want to format the data as an edge list, like this:
Movie | User
1: | 5
1: | 1
1: | 3
2: | 2
2: | 3
2: | 5
I wrote the code below to perform this function. Basically, for every row, it check if its a movie id (containing ':') or if it's user data. It then combines the movie id and user id as two columns for every movie and user, and then rowbinds it to a new data frame. At the same time, it also only binds those users who rate a movie 5 out of 5.
el <- data.frame(matrix(ncol = 2, nrow = 0))
for (i in 1:nrow(data))
{
if (grepl(':', data[i,]))
{
mid <- data[i,]
} else(grepl(',', data[i,]))
{
if(grepl(',5,', data[i,]))
{
uid <- unlist(strsplit(data[i,], ','))[1]
add <- c(mid, uid)
el <- rbind(el, add)
}
}
}
However, I have about 100 million entries, and the for loop runs throughout the night without being able to complete. Is there a way to speed this up? I read about vectorization, but I can't figure out how to vectorize this function. Any help?
You can do this with a few regular expressions, for which I'll use the stringr package, as well as na.locf from the zoo package. (You'll have to install stringr and zoo first).
First we'll set up your data, which it sounds like is in a one-column data frame:
data <- read.table(textConnection("1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
"))
You can then follow the following steps (explanation in comments).
# Pull out the column as a character vector for simplicity
lines <- data[[1]]
library(stringr)
# Figure out which lines represent movie IDs, and extract IDs
movie_ids <- str_match(lines, "(\\d+):")[, 2]
# Fill the last observation carried forward (locf), to find out
# the most recent non-NA value
library(zoo)
movie_ids_filled <- na.locf(movie_ids)
# Extract the user IDs
user_ids <- str_match(lines, "(\\d+),")[, 2]
# For each line that has a user ID, match it to the movie ID
result <- cbind(movie_ids_filled[!is.na(user_ids)],
user_ids[!is.na(user_ids)])
This gets the result
[,1] [,2]
[1,] "1" "5"
[2,] "1" "1"
[3,] "1" "3"
[4,] "2" "2"
[5,] "2" "3"
[6,] "2" "5"
The most important part of this code is the use of regular expressions, particularly the capturing groups in parentheses of "(\\d+):" and (\\d+),. For more on using str_match with regular expressions, do check out this guide.
Hello I would like to select rows in form of list from a dataframe. Here is my dataframe:
df2 <- data.frame("user_id" = 1:2, "username" = c(215,154), "password" = c("John4","Dora4"))
now with this dataframe I can only select 1 column to view rows as a list, which I did with this code
df2[["user_id"]]
output is
[1] 1 2
but now when I try this with more columns I am told its out of bounds, what is the problem here
df2[["user_id", "username"]]
How can I resolve and get the results of rows as a list
If I understood your question correctly, you need to familiarize yourself with subsetting in R. These are ways to select multiple columns in R:
df2[,c('user_id', 'username')]
or
df2[,1:2]
If you want to return all columns as a list, you can use something like this:
lapply(1:ncol(df2), function(x) df2[,x])
The format is df2['rows','columns'], so you should use:
df2[,c("user_id", "username")]
To get them 'in form of list', do:
as.list(df2[,c("user_id", "username")])
The double bracket [[ notion is used to select a single unnamed element (in this case a single unnamed column since data frames are essentially lists of column data).
See this answer for more on double vs single bracket notion: https://stackoverflow.com/a/1169495/8444966
This should give you a row of list (There's got to be an answer somewhere here).
row_list<- as.list(as.data.frame(t(df2[c("user_id", "username")])))
#$V1
#[1] 1 215
#$V2
#[1] 2 154
If you want to keep names of the rows.
df2_subset <- df2[c("user_id", "username")]
setNames(split(df2_subset, seq(nrow(df2_subset))), rownames(df2_subset))
#$`1`
# user_id username
#1 1 215
#$`2`
# user_id username
#2 2 154
I have a character string looking like this:
string <- c("1","2","3","","5","6","")
I would like to replace the gaps by the previous value, obtaining a string similar to this:
string <- c("1","2","3","3","5","6","6")
I have adjusted this solution (Replace NA with previous and next rows mean in R) and I do get the correct result:
string <- as.data.frame(string)
ind <- which(string == "")
string$string[ind] <- sapply(ind, function(i) with(string, string[i-1]))
This way is however quite cumbersome and there must be an easier way that does not require me to transform the string to a data frame first. Thanks for your help!
We can use na.locf from zoo after changing the blank ("") to NA so that the NA values get replaced by the non-NA adjacent previous values
library(zoo)
na.locf(replace(string, string =="", NA))
#[1] "1" "2" "3" "3" "5" "6" "6"
If there is only atmost one blank between the elements, then create an index as in the OP's post and then do the replacement by the element corresponding to the index subtracted 1
i1 <- which(string == "")
string[i1] <- string[i1-1]
I have data set with Ids for example
19878, 19659, 19855, 18658, 18996, 18002
I want to filter the IDs based on the number in ID. For example, I want to filter data with ID having 9 in number two position of ID i.e. 19878, 19659, 19855 etc.
try this :
data <- c(19878, 19659, 19855, 18658, 18996, 18002)
Extract your 2nd position of each ID in "data" with substr() :
substr(data,2,2)
[1] "9" "9" "9" "8" "8" "8"
Find out with grepl() which IDs contain a 9 at the 2nd position :
grepl(9,substr(data,2,2))
[1] TRUE TRUE TRUE FALSE FALSE FALSE
Cross your result with your "data" object :
data[grepl(9,substr(data,2,2))]
[1] 19878 19659 19855
Edit :
Faster solution by Gregor (removing grepl step):
data[substr(data,2,2) == "9"]
I did that by selecting the ID less than or equal to 19000 to separate IDs with 9in second position.
xam8=exam[exam[,1]<19000,] ##### selection of 8
exam9=exam[exam[,1]>19000,] ##### selection of 9
But the answers posted here are useful if specific IDS are needed from whole random data
Here is sample data:
main.data <- c("id","num","open","close","char","gene","valid")
data.step.1 <- list(id="12",num="00",open="01-01-2015",char="yes",gene="1234",valid="NA")
match.step.1 <- unlist(data.step.1)
The main.data are the column names of all possible column data.
I have a loop that streams data step-by-step, which could have missing column (list name).
I would like to match the each step (data.step.n) against the master column names (main.data).
Desired output:
id num open close char gene valid
"12" "00" "01-01-2015" "" "yes" "1234" "NA"
How can I unlist the data and match it against the names so that if the entry is missing like in this case close that would be filled with empty string.
Try
v1 <- setNames(rep('', length(main.data)), main.data)
v1[main.data %in% names(match.step.1)] <- match.step.1
Or use match
v1[match(names(match.step.1), main.data)] <- match.step.1
Or just use [
v2 <- setNames(match.step.1[main.data], main.data)
v2[is.na(v2)] <- ''