I currently have two tables, endingStock and totalUsage, with each row in each table having a date entry. I want to iterate over the rows of each table, compare the date, and if the dates are the same, I want to extract the value of a variable unit value from both endingStock and totalUsage, and append endingstock$unitvalue / totalUsage$unitvalue to a third table, ratioTable.
Using a bunch of different answers on this site, I've tried setting ratioTable to NULL and building up the table row by row, using the following loop:
for (i in 1:nrow(endingStocks)) {
for (j in 1:nrow(useTotals)) {
if (endingStocks[i,]$valuedate == useTotals[j,]$valuedate) {
ratio = endingStocks[i,]$`unit value` / useTotals[j,]$`unit value`
newrow <- c(endingStocks[i,]$valuedate, ratio)
ratioTable <- rbind(ratioTable, newrow)
}
}
}
However, this seems to be skipping values. Each dataframe has over 200 entries that are roughly matched in terms of date, and so the resultant ratioTable should have the same order of entries, but instead only has 24.
1) Is there a way to effectively do this using vectorized operations?
2) Are there any glaring faults with my code?
I'm guessing without a reproducible example, but I think using merge will do what you want:
ratioTable = merge(endingStocks, totalUsage, by="valuedate")
You can then tidy your new table as required.
Related
First of all, I am using the ukpolice library in R and extracted data to a new data frame called crimes. Now i am running into a new problem, i am trying to extract certain data to a new empty data frame called df.shoplifting if the category of the crime is equal to "shoplifiting" it needs to add the id, month and street name to the new dataframe. I need to use a loop and if statement togheter.
EDIT:
Currently i have this working but it lacks the IF statemtent:
for (i in crimes$category) {
shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
names(shoplifting) <- c("ID", "Month", "Street_Name")
}
What i am trying to do:
for (i in crimes$category) {
if(crimes$category == "shoplifting"){
data1 <- subset(crimes, category == i, select = c(id, month, street_name))
}
}
It does run and create the new data frame data1. But the data that it extracts is wrong and does not only include items with the shoplifting category..
I'll guess, and update if needed based on your question edits.
rbind works only on data.frame and matrix objects, not on vectors. If you want to extend a vector (N.B., that is not part of a frame or column/row of a matrix), you can merely extend it with c(somevec, newvals) ... but I think that this is not what you want here.
You are iterating through each value of crimes$category, but if one category matches, then you are appending all data within crimes. I suspect you mean to subset crimes when adding. We'll address this in the next bullet.
One cannot extend a single column of a multi-column frame in the absence of the others. A data.frame as a restriction that all columns must always have the same length, and extending one column defeats that. (And doing all columns immediately-sequentially does not satisfy that restriction.)
One way to work around this is to rbind a just-created data.frame:
# i = "shoplifting"
newframe <- subset(crimes, category == i, select = c(id, month, street_name))
names(newframe) <- c("ID", "Month", "Street_Name") # match df.shoplifting names
df.shoplifting <- rbind(df.shoplifting, newframe)
I don't have the data, but if crimes$category ever has repeats, you will re-add all of the same-category rows to df.shoplifting. This might be a problem with my assumptions, but is likely not what you really need.
If you really just need to do it once for a category, then do this without the need for a for loop:
df.shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
# optional
names(df.shoplifting) <- c("ID", "Month", "Street_Name")
Iteratively adding rows to a frame is a bad idea: while it works okay for smaller datasets, as your data scales, the performance worsens. Why? Because each time you add rows to a data.frame, the entire frame is copied into a new object. It's generally better to form a list of frames and then concatenate them all later (c.f., https://stackoverflow.com/a/24376207/3358227).
On this note, if you need one frame per category, you can get that simply with:
df_split(df, df$category)
and then operate on each category as its own frame by working on a specific element within the df_split named list (e.g., df_split[["shoplifting"]]).
And lastly, depending on the analysis you're doing, it might still make sense to keep it all together. Both the dplyr and data.table dialects of R making doing calculations on data within groups very intuitive and efficient.
Try:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),]
Using a for loop in this instance will work, but when working in R you want to stick to vectorized operations if you can.
This operation subsets the crimes dataframe and selects rows where the category column is equal to shoplifting. It is not necessary to convert the category column into a factor - you can match the string with the == operator.
Note the comma at the end of the which(...) function, inside of the square brackets. The which function returns indices (row numbers) that meet the criteria. The comma after the function tells R that you want all of the rows. If you wanted to select only a few rows you could do:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c("id","Month","Street_Name")]
OR you could call the columns based on their number (I don't have your data so I don't know the numbers...but if the columns id, Month, Street_Name, you could use 1, 2, 3).
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c(1,2,3)]
For a course at university where we learn how to do R, we have to filter a dataframe supplied (called crimes). The original dataframe has 8 columns.
I do not think I can supply the data set, since it is part of an assignment for school. But any advice would be really appreaciated.
The requirements of the tasks are to use a loop and an if-statement, to filter one column ("category") and take only the rows with one specific level (out of 14) (named "drugs"). Then printing only three out of the eight columns of those rows into a new dataframe.
for (i in crimes$category) {
if (i == "drugs") {
drugs <- rbind(drugs, crimes[c(2,3,7)])
}
}
Now I know the problem is in the rbind function, since it now just duplicates all rows 160 times (there are 160 rows with the category "drugs". But I do not no how to get a dataframe with 160 observations and only 3 variables.
I do not think I can supply the data set, since it is part of an assignment for school. But any advice would be really appreaciated.
Note that the assignment defeats the purpose of using R. But that said, use the for / if construct to get the row numbers containing the category value "drugs" then create the result df outside the loop:
keep <- integer()
for (i in crimes$category) {
if (i == "drugs") {
keep <- c(keep, i)
}
}
crimes2 <- crimes[keep, c(2,3,7)]
Note the base R no-loop solution would be:
crimes2 <- crimes[crimes$category == "drugs", c(2,3,7)]
I have a dataframe (df) with a column of Latitude (Lat), and I need to match up the corresponding Longitude value (based off relationships in another dataset). New column name is 'Long_matched'.
Here, I am trying to write a new value in the column 'Long_matched' at the corresponding row to latitudes between -33.9238 and -33.9236. The data in 'Lat' has many more decimal places (e.g: -33.9238026666667, -33.9236026666667, etc.). As I will be applying this code to multiple datasets over the same geographical location (hence the long decimals will vary slightly), I want to write Longitude values which fall within a a 0.0002 degree range.
Some attempts of code I have tried include:
df$Long_matched <- ifelse(df$Lat< -33.9236 & df$Lat> -33.9238, 151.2279 , "N/A")
or
df$Long_matched[df$Lat< -33.9236 & df$Lat> -33.9238] <- 151.2279
I think I need to use a for loop to loop through the rows and an if statement, but struggling to figure this out - any help would be appreciated!
Resulting output should look something like this:
Lat Long_matched
-33.9238026666667 151.2279
-33.9236026666667 (new long value will go here)
Everything said in the comments applies, but here is a trick you can try:
In the following code, you will need to replace text with numbers.
Latitude_breaks <- seq(min_latitude, max_latitude, 0.0002) # you need to replace `min_latitude`, `max_latitude`, and `increment` with numbers
Longitude_values <- seq(first, last, increment) # you need to replace `first`, `last` and `increment` with numbers
df <- within(df, {
# make a categorical version of `Lat`
Lat_cat <- cut(Lat, Latitude_breaks)
Long_matched <- Longitude_values[Lat_cat]
})
A few notes:
the values between min_latitude and min_latitude + 1 will be assigned to the values of Longitude marked first.
The length of Latitude_beaks should be one more than the length of Longitude_values.
Values of Lat outside of Latitude_breaks will becomes NAs.
This works by exploiting a nice feature of factors - they are stored as integers. So we can use them to index another vector - in this case, Longitude_values.
I have two data frames: 1) df_events 2) df_payments. Both contains user_id's and timestamps, while payments table also stores payment amounts. A new column revenue needs to be added to the events table that would be a sum of amounts for a particular user up until the time of an event.
Here's my attempt at accomplishing that. Consider this as a pseudo-code:
getRevenue <- function(userid,event_time) {
temp <- filter(df_payments, user_id == userid & payment_time <= event_time)
revenue <- sum(temp$amount)
return(revenue)
}
for (row in 1:nrow(df_events)) {
df_events[row,"revenue"] <- getRevenue( df_events[row,"user_id"] , df_events[row,"event_time"] )
}
Been reading up on apply family but what throws me off is the need to look up two parameters simultaneously and do different things with them. Usually apply examples demonstrate the use of a single parameter such as FUN=function(x) myfun(x) but how do I pass two columns from the row and treat these columns differently, i.e. perform == comparison with one and <= with the other?
I wrote the following code to extract multiple datasets out of one large dataset based on the column Time.
for(i in 1:nrow(position)) {
assign(paste("position.",i,sep=""), subset(dataset, Time >= position[i,1] & Time <= position[i,2])
)
}
(position is a list which contains the starttime[,1] and stoptime[,2])
The outputs are subsets of my original dataset and looke like:
position.1
position.2
position.3
....
Is there a possibility to add an extra column to each of the new datasets (position.1, position.2, ...) Which defines them by a number?
eg: position.1 has an extra column with value 1, position.2 has an extra column with value 2, and so on.
I need those numbers to identify the datasets (position.1, position.2, ...) after I rbind them in a last step to on dataset again.
Since you don't provide example data, this is untested, but should work for you:
dflist <-
lapply(1:nrow(position), function(x) {
within(dataset[dataset$Time >= position[x,1] & dataset$Time <= position[x,2],], val = x)
}
do.call(rbind, dflist)
Basically, you never want to take the strategy you propose of assigning multiple numbered objects to the global environment. It is much easier to store all of the subsets in a list and then bind them back together using do.call(rbind, dflist). This is more efficiently, produces less clutter in your workspace, and is a more "functional" style of programming.
In addition to Thomas's recommendation to avoid side effects, you might want to take advantage of existing packages that detect overlaps. The IRanges package in Bioconductor can detect overlaps between one set of ranges (position) and another set of ranges or positions (dataset$Time). This gets you the matches between the time points and the ranges:
r <- IRanges(position[[1L]], position[[2L]])
hits <- findOverlaps(dataset$Time, r)
Now, you want to extract a subset of the dataset that overlaps each range in position. We can group the query (Time) indices by the subject (position) indices and extract a list from the dataset using that grouping:
dataset <- DataFrame(dataset)
l <- extractList(dataset, split(queryHits(hits), subjectHits(hits)))
To get the final answer, we need to combine the list elements row-wise, while adding a column that denotes their group membership:
ans <- stack(l)