Selecting dates in a data.table for new column in r - r
I have a data table with 65 variables. I want to create a new column for Semester which is allocated to semester 1 all IDs dated before 2015-03-31 (all others are Semester 2).
students<-data.table(studid=c(1:6) ,FAC = c("IT","SCIENCE","LAW","IT","COMMERCE","COMMERCE"),dates = c("2010-12-01","2010-03-01", "2010-03-01","2010-05-20", "2010-03-01","2010-03-31"))
I have set the date class:
students$dates<-as.Date(students$dates)
I have then specified the new column:
students[,Semester:=2,]
Then I have tried:
students$Semester[students$dates < 2015-05-31]<-1
But this does not work. Any advice?
First of all, I would recommend start using data.table proper syntax. All of these $, <- etc. is base R syntax which doesn't take advantage of data.table capabilities. Please read the vignettes in this link
In other words, converting to date, for example, is done using (no need in <- or $)
students[, dates := as.IDate(dates)]
Which will update your data by reference
Second of all, when you just do 2015-05-31, you are basically just writing an equation: 2015-05-31 = 1979. Post it in the console and see what you get. In other words, you need to quote "2015-05-31" so R will know it's a string (which will be dispatched to a Date class later while parsed to <).
Finally, here's the solution using data.table syntax
students[dates < "2015-05-31", Semester := 1]
Related
Iterate through and conditionally append string values in a Pandas dataframe
I've got a dataframe of research participants whose IDs are stored in the following format "0000.000". Where the first four digits are their family ID number, and the final three digits are their individual index within the family. The majority of individuals have a suffix of ".000", but some have ".001", ".002", etc. As a result of some inefficiencies, these numbers are stored as floats. I'm trying to import them as strings so that I can use them in a join to another data frame that is formatted correctly. Those IDs that end in .000 are imported as "0000", rather than "0000.000". All others are imported correctly. I'm trying to iterate through the IDs and append ".000" to those that are missing the suffix. If I were using R, I could do it like this. df %>% mutate(StudyID = ifelse(length(StudyID)<5, paste(StudyID,".000",sep=""), StudyID) I've found a Python solution (below), but it's pretty janky. row = 0 for i in df["StudyID"]: if len(i)<5: df.iloc[row,3] = i + ".000" else: df.iloc[row,3] = i index += 1 I think it'd be ideal to do it as a list comprehension, but I haven't been able to find a solution that lets me iterate through the column, changing a single value at a time. For example, this solution iterates and checks the logic properly, but it replaces every single value that evaluates True during each iteration. I only want the value currently being evaluated to change. [i + ".000" if len(i)<5 else i for i in df["StudyID"]] Is this possible?
As you said, your code is doing the trick. One other way of doing what you want that i could think of is the following : # Start by creating a mask that gives you the index you want to change mask = [len(i)<5 for i in df.StudyID] # Change the value of the dataframe on the mask df.StudyID.iloc[mask] += ".000"
I think by length(StudyID), you meant nchar(StudyID), as #akrun pointed out. You can do it in the dplyr way in python using datar: >>> from datar.all import f, tibble, mutate, nchar, if_else, paste >>> >>> df = tibble( ... StudyID = ["0000", "0001", "0000.000", "0001.001"] ... ) >>> df StudyID <object> 0 0000 1 0001 2 0000.000 3 0001.001 >>> >>> df >> mutate(StudyID=if_else( ... nchar(f.StudyID) < 5, ... paste(f.StudyID, ".000", sep=""), ... f.StudyID ... )) StudyID <object> 0 0000.000 1 0001.000 2 0000.000 3 0001.001 Disclaimer: I am the author of the datar package.
Ultimately, I needed to do this for a few different dataframes so I ended up defining a function to solve the problem so that I could apply it to each one. I think the list comprehension idea was going to become too complex and potentially too difficult to understand when reviewing so I stuck with a plain old for-loop. def create_multi_index(data, col_to_split, sep = "."): """ This function loops through the original ID column and splits it into multiple parts (multi-IDs) on the defined separator. By default, the function assumes the unique ID is formatted like a decimal number The new multi-IDs are appended into a new list. If the original ID was formatted like an integer, rather than a decimal the function assumes the latter half of the ID to be ".000" """ # Take a copy of the dataframe to modify new_df = data # generate two new lists to store the new multi-index Family_ID = [] Family_Index = [] # iterate through the IDs, split and allocate the pieces to the appropriate list for i in new_df[col_to_split]: i = i.split(sep) Family_ID.append(i[0]) if len(i)==1: Family_Index.append("000") else: Family_Index.append(i[1]) # Modify and return the dataframe including the new multi-index return new_df.assign(Family_ID = Family_ID, Family_Index = Family_Index) This returns a duplicate dataframe with a new column for each part of the multi-id. When joining dataframes with this form of ID, as long as both dataframes have the multi index in the same format, these columns can be used with pd.merge as follows: pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])
R - Assign the mean of a column sub-sector to each row of that sub-sector
I am trying to create a column which has the mean of a variable according to subsectors of my data set. In this case, the mean is the crime rate of each state calculated from county observations, and then assigning this number to each county relative to the state they are located in. Here is the function wrote. Create the new column Data.Final$state_mean <- 0 Then calculate and assign the mean. for (j in range[1:3136]) { state <- Data.Final[j, "state"] Data.Final[j, "state_mean"] <- mean(Data.Final$violent_crime_2009-2014, which(Data.Final[, "state"] == state)) } Here is the following error Error in range[1:3137] : object of type 'builtin' is not subsettable Very much appreciated if you could, take a few minutes to help a beginner out.
You've got a few problems: range[1:3136] isn't valid syntax. range(1:3136) is valid syntax, but the range() function just returns the minimum and maximum. You don't need anything more than 1:3136, just use for (j in 1:3136) instead. Because of the dash, violent_crime_2009-2014 isn't a standard column name. You'll need to use it in backticks, Data.Final$\violent_crime_2009-2014`` or in quotes with [: Data.Final[["violent_crime_2009-2014"]] or Data.Final[, "violent_crime_2009-2014"] Also, your code is very inefficient - you re-calculate the mean on every single time. Try having a look at the Mean by Group R-FAQ. There are many faster and easier methods to get grouped means. Without using extra packages, you could do Data.Final$state_mean = ave(x = Data.Final[["violent_crime_2009-2014"]], Data.Final$state, FUN = mean) For friendlier syntax and greater efficiency, the data.table and dplyr packages are popular. You can see examples using them at the link above.
Here is one of many ways this can be done (I'm sure someone will post a tidyverse answer soon if not before I manage to post): # Data for my example: data(InsectSprays) # Note I have a response column and a column I could subset on str(InsectSprays) # Take the averages with the by var: mn <- with(InsectSprays,aggregate(x=list(mean=count),by=list(spray=spray),FUN=mean)) # Map the means back to your data using the by var as the key to map on: InsectSprays <- merge(InsectSprays,mn,by="spray",all=TRUE) Since you mentioned you're a beginner, I'll just mention that whenever you can, avoid looping in R. Vectorize your operations when you can. The nice thing about using aggregate, and merge, is that you don't have to worry about errors in your mapping because you get an index shift while looping and something weird happens. Cheers!
LHS:RHS vs functional in data.table
Why does the functional ':=' not aggregate unique rows using 'by' yet LHS:RHS does aggregate using 'by'? Below is a .csv file of 20 rows of data with 58 variables. A simple copy, paste, delim = .csv works. I am still trying to find the best way to post sample data to SO. The 2 variants of my code are: prodMatrix <- so.sample[, ':=' (Count = .N), by = eval(names(so.sample)[2:28])] ---this version does not aggregate the rowID using by--- prodMatrix <- so.sample[, (Count = .N), by = eval(names(so.sample)[2:28])] ---this version does aggregate the rowID using by--- "CID","NetIncome_length_Auto Advantage","NetIncome_length_Certificates","NetIncome_length_Comm. Share Draft","NetIncome_length_Escrow Shares","NetIncome_length_HE Fixed","NetIncome_length_HE Variable","NetIncome_length_Holiday Club","NetIncome_length_IRA Certificates","NetIncome_length_IRA Shares","NetIncome_length_Indirect Balloon","NetIncome_length_Indirect New","NetIncome_length_Indirect RV","NetIncome_length_Indirect Used","NetIncome_length_Loanline/CR","NetIncome_length_New Auto","NetIncome_length_Non-Owner","NetIncome_length_Personal","NetIncome_length_Preferred Plus Shares","NetIncome_length_Preferred Shares","NetIncome_length_RV","NetIncome_length_Regular Shares","NetIncome_length_S/L Fixed","NetIncome_length_S/L Variable","NetIncome_length_SBA","NetIncome_length_Share Draft","NetIncome_length_Share/CD Secured","NetIncome_length_Used Auto","NetIncome_sum_Auto Advantage","NetIncome_sum_Certificates","NetIncome_sum_Comm. Share Draft","NetIncome_sum_Escrow Shares","NetIncome_sum_HE Fixed","NetIncome_sum_HE Variable","NetIncome_sum_Holiday Club","NetIncome_sum_IRA Certificates","NetIncome_sum_IRA Shares","NetIncome_sum_Indirect Balloon","NetIncome_sum_Indirect New","NetIncome_sum_Indirect RV","NetIncome_sum_Indirect Used","NetIncome_sum_Loanline/CR","NetIncome_sum_New Auto","NetIncome_sum_Non-Owner","NetIncome_sum_Personal","NetIncome_sum_Preferred Plus Shares","NetIncome_sum_Preferred Shares","NetIncome_sum_RV","NetIncome_sum_Regular Shares","NetIncome_sum_S/L Fixed","NetIncome_sum_S/L Variable","NetIncome_sum_SBA","NetIncome_sum_Share Draft","NetIncome_sum_Share/CD Secured","NetIncome_sum_Used Auto","totNI","Count","totalNI" 93,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,-123.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,212.97,0,0,0,-71.36,0,0,0,49.01,0,0,67.42,6,404.52 114,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14.54,0,0,0,0,0,-285.44,0,0,0,49.01,0,0,-221.89,90,-19970.1 1112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60.23,0,0,0,0,-101.55,0,-71.36,0,0,0,98.02,0,0,-14.66,28,-410.48 5366,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85 6078,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,7,0,0,0,1,0,0,0,0,0,0,0,0,-17.44,0,0,0,0,0,0,0,14.54,0,0,0,0,0,-499.52,0,0,0,49.01,0,0,-453.41,3,-1360.23 11684,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85 47358,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,-14.43,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-85.79,3194,-274013.26 193761,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-71.36,0,0,0,49.01,0,0,-123.9,9973,-1235654.7 232530,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85 604897,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85 1021309,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32 1023633,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32 1029726,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60.23,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,37.88,8688,329101.44 1040005,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85 1040092,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85 1064453,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14.54,0,212.97,0,0,0,-142.72,0,0,0,0,0,0,84.79,49,4154.71 1067508,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,-123.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-194.56,4162,-809758.72 1080303,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32 1181005,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-142.72,0,0,0,98.02,0,0,-146.25,614,-89797.5 1200484,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-285.44,0,0,0,0,0,0,-386.99,50,-19349.5
Because := is making operations by reference. It means it will not invoke in-memory copy of your dataset but it will update it in-place. Making aggregation of your dataset is a copy of it's original unaggregated form. You can read more about it in Reference semantics vignette. This is a design concept in data.table that := is used to update by reference and other forms - .(), list() or direct expression are used to query data. And query data isn't a by reference operation. The by reference operation is not able to aggregate rows, it can just calculate aggregates and put it into dataset in-place. Query is able to aggregate dataset because query result is not the same object in memory as original data.table.
Update data.table column changes data type
I am testing a small scale scenario before rolling it out in a larger production environment and am experiencing a strange occurrence. I have 2 data sets: dtL <- data.table(URN=c(1,2,3,4,5), DonorType=c("Cash","RG","Emergency","Emergency","Cash")) dtL[,c("EmergVal","EmergDate") := list(as.numeric(NA),as.Date(NA))] setkey(dtL,URN) dtR <- data.table(URN = c(1,1,1,2,3,3 ,3 ,4,4, 4,4,5), class=c(5,5,5,1,5,40,40,5,40,5,40,5), xx= c(25,50,25,10,100,20,25,20,40,35,20,25), xdate=as.Date(c("2013-01-01","2013-06-05","2014-05-27","2014-10-14", "2014-06-09","2014-04-07","2014-10-16", "2014-07-16","2014-10-21","2014-10-22","2014-09-18","2013-12-19"))) setkey(dtR,URN) I am wanting to update dtL where the DonorType is equal to "Emergency", but only for a subset of records from dtR. I have seen Update subset of data.table based on join and thus have used that as a foundation for my solution. dtL[dtR[class==40,list(maxxx=max(xx)),by=URN], EmergVal := ifelse(DonorType=="Emergency",i.maxxx,as.numeric(NA))] dtL[dtR[class==40,list(maxdate=max(xdate)),by=URN], EmergDate := ifelse(DonorType=="Emergency",as.Date(i.maxdate),as.Date(NA)),nomatch=0] I don't get any errors, however when I look at the data now in dtL it has changed the datatype for EmergDate to num rather than what it originally was (i.e. Date). So three questions Why has it changed the data type (especially when it is a Date when first created in dtL, and I tell it to put it as a date in my ifelse statement? How do I get it to keep the date type when I assign it? or will I have to do some post assignment conversion/castint? Is there a clean way I could do my assignment of EmergVal and EmergDate in a single statement given that I don't have a field DonorType in dtR and I don't want to add it in (so can't use a multiple key for the join)?
Replacing for-loops in R
I am trying to stop using for loops when I code but I have a bit of a problem representing a simple operation. Let's say I am trying to do simple nearest-neighbour estimation on a dataset for a company that owns several restaurants. I have three features: City, Store, Month and one target function Sales. City,Store and Month are all represented with numbers: City takes values between 1-100, Store takes values between 1-50 and Month between 1-12. Now, I want to replace this for-loop with an apply function: for (c in 1:100){ for (s in 1:50){ for (m in 1:12){ dat1$Sales[dat1$City==c & dat1$Store==s & dat1$Month==m & is.na(dat1$Sales)] <- mean(dat1$Sales[dat1$City==c & dat1$Store==s & dat1$Month==m & !is.na(dat1$Sales)]) } } } What is the complexity of this apply function? Many thanks!
Try using aggregate. It has a formula like interface that makes it easy to get the results of a function applied on parts of a data.frame. Then just assign the result to the place in dat1 that needs it. TempOut<- aggregate(Sales~City+Store+Month, FUN=mean,data=dat1) dat1$Sales[is.na(dat1$Sales),]<-TempOut[TempOut$City==[dat1[is.na(dat1$Sales),]$City & TempOut$Store==[dat1[is.na(dat1$Sales),]$Store & TempOut$Month== [dat1[is.na(dat1$Sales),]$Month,]$Sales You could combine the creation of TempOut and assignment to dat1$Sales into one line, but that would have made this even harder to read. I don't have your data so I can't test this - but this should get you on the right track, even if there is a typo in there.
Here's a data.table way: require(data.table) setDT(dat1) dat1[, Sales:={ m=mean(Sales,na.rm=TRUE) replace(Sales, is.na(Sales), m) },by=.(City, Store, Month)] It would be nice to have something like Sales[is.na(Sales)]:=..., but this is just a feature request right now. Here is a similar question.