I wrote the following code to extract multiple datasets out of one large dataset based on the column Time.
for(i in 1:nrow(position)) {
assign(paste("position.",i,sep=""), subset(dataset, Time >= position[i,1] & Time <= position[i,2])
)
}
(position is a list which contains the starttime[,1] and stoptime[,2])
The outputs are subsets of my original dataset and looke like:
position.1
position.2
position.3
....
Is there a possibility to add an extra column to each of the new datasets (position.1, position.2, ...) Which defines them by a number?
eg: position.1 has an extra column with value 1, position.2 has an extra column with value 2, and so on.
I need those numbers to identify the datasets (position.1, position.2, ...) after I rbind them in a last step to on dataset again.
Since you don't provide example data, this is untested, but should work for you:
dflist <-
lapply(1:nrow(position), function(x) {
within(dataset[dataset$Time >= position[x,1] & dataset$Time <= position[x,2],], val = x)
}
do.call(rbind, dflist)
Basically, you never want to take the strategy you propose of assigning multiple numbered objects to the global environment. It is much easier to store all of the subsets in a list and then bind them back together using do.call(rbind, dflist). This is more efficiently, produces less clutter in your workspace, and is a more "functional" style of programming.
In addition to Thomas's recommendation to avoid side effects, you might want to take advantage of existing packages that detect overlaps. The IRanges package in Bioconductor can detect overlaps between one set of ranges (position) and another set of ranges or positions (dataset$Time). This gets you the matches between the time points and the ranges:
r <- IRanges(position[[1L]], position[[2L]])
hits <- findOverlaps(dataset$Time, r)
Now, you want to extract a subset of the dataset that overlaps each range in position. We can group the query (Time) indices by the subject (position) indices and extract a list from the dataset using that grouping:
dataset <- DataFrame(dataset)
l <- extractList(dataset, split(queryHits(hits), subjectHits(hits)))
To get the final answer, we need to combine the list elements row-wise, while adding a column that denotes their group membership:
ans <- stack(l)
Related
First of all, I am using the ukpolice library in R and extracted data to a new data frame called crimes. Now i am running into a new problem, i am trying to extract certain data to a new empty data frame called df.shoplifting if the category of the crime is equal to "shoplifiting" it needs to add the id, month and street name to the new dataframe. I need to use a loop and if statement togheter.
EDIT:
Currently i have this working but it lacks the IF statemtent:
for (i in crimes$category) {
shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
names(shoplifting) <- c("ID", "Month", "Street_Name")
}
What i am trying to do:
for (i in crimes$category) {
if(crimes$category == "shoplifting"){
data1 <- subset(crimes, category == i, select = c(id, month, street_name))
}
}
It does run and create the new data frame data1. But the data that it extracts is wrong and does not only include items with the shoplifting category..
I'll guess, and update if needed based on your question edits.
rbind works only on data.frame and matrix objects, not on vectors. If you want to extend a vector (N.B., that is not part of a frame or column/row of a matrix), you can merely extend it with c(somevec, newvals) ... but I think that this is not what you want here.
You are iterating through each value of crimes$category, but if one category matches, then you are appending all data within crimes. I suspect you mean to subset crimes when adding. We'll address this in the next bullet.
One cannot extend a single column of a multi-column frame in the absence of the others. A data.frame as a restriction that all columns must always have the same length, and extending one column defeats that. (And doing all columns immediately-sequentially does not satisfy that restriction.)
One way to work around this is to rbind a just-created data.frame:
# i = "shoplifting"
newframe <- subset(crimes, category == i, select = c(id, month, street_name))
names(newframe) <- c("ID", "Month", "Street_Name") # match df.shoplifting names
df.shoplifting <- rbind(df.shoplifting, newframe)
I don't have the data, but if crimes$category ever has repeats, you will re-add all of the same-category rows to df.shoplifting. This might be a problem with my assumptions, but is likely not what you really need.
If you really just need to do it once for a category, then do this without the need for a for loop:
df.shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
# optional
names(df.shoplifting) <- c("ID", "Month", "Street_Name")
Iteratively adding rows to a frame is a bad idea: while it works okay for smaller datasets, as your data scales, the performance worsens. Why? Because each time you add rows to a data.frame, the entire frame is copied into a new object. It's generally better to form a list of frames and then concatenate them all later (c.f., https://stackoverflow.com/a/24376207/3358227).
On this note, if you need one frame per category, you can get that simply with:
df_split(df, df$category)
and then operate on each category as its own frame by working on a specific element within the df_split named list (e.g., df_split[["shoplifting"]]).
And lastly, depending on the analysis you're doing, it might still make sense to keep it all together. Both the dplyr and data.table dialects of R making doing calculations on data within groups very intuitive and efficient.
Try:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),]
Using a for loop in this instance will work, but when working in R you want to stick to vectorized operations if you can.
This operation subsets the crimes dataframe and selects rows where the category column is equal to shoplifting. It is not necessary to convert the category column into a factor - you can match the string with the == operator.
Note the comma at the end of the which(...) function, inside of the square brackets. The which function returns indices (row numbers) that meet the criteria. The comma after the function tells R that you want all of the rows. If you wanted to select only a few rows you could do:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c("id","Month","Street_Name")]
OR you could call the columns based on their number (I don't have your data so I don't know the numbers...but if the columns id, Month, Street_Name, you could use 1, 2, 3).
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c(1,2,3)]
For a course at university where we learn how to do R, we have to filter a dataframe supplied (called crimes). The original dataframe has 8 columns.
I do not think I can supply the data set, since it is part of an assignment for school. But any advice would be really appreaciated.
The requirements of the tasks are to use a loop and an if-statement, to filter one column ("category") and take only the rows with one specific level (out of 14) (named "drugs"). Then printing only three out of the eight columns of those rows into a new dataframe.
for (i in crimes$category) {
if (i == "drugs") {
drugs <- rbind(drugs, crimes[c(2,3,7)])
}
}
Now I know the problem is in the rbind function, since it now just duplicates all rows 160 times (there are 160 rows with the category "drugs". But I do not no how to get a dataframe with 160 observations and only 3 variables.
I do not think I can supply the data set, since it is part of an assignment for school. But any advice would be really appreaciated.
Note that the assignment defeats the purpose of using R. But that said, use the for / if construct to get the row numbers containing the category value "drugs" then create the result df outside the loop:
keep <- integer()
for (i in crimes$category) {
if (i == "drugs") {
keep <- c(keep, i)
}
}
crimes2 <- crimes[keep, c(2,3,7)]
Note the base R no-loop solution would be:
crimes2 <- crimes[crimes$category == "drugs", c(2,3,7)]
I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])
I am currently working on a code which applies to various datasets from an experiment which looks at a wide range of variables which might not be present in every repetition. My first step is to create an empty dataset with all the possible variables, and then write a function which retains columns that are in the dataset being inputted and delete the rest. Here is an example of how I want to achieve this:-
x<-c("a","b","c","d","e","f","g")
y<-c("c","f","g")
Is there a way of removing elements of x that aren't present in y and/or retaining values of x that are present in y?
For your first question: "My first step is to create an empty dataset with all the possible variables", I would use factor on the concatenation of all the vectors, for example:
all_vect = c(x, y)
possible = levels(factor(all_vect))
Then, for the second part " write a function which retains columns that are in the dataset being inputted and delete the rest", I would write:
df[,names(df)%in%possible]
As akrun wrote, use intersect(x,y) or
> x[x %in% y]
My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.