sorry for the ugly code, but I'm not sure exactly what's going wrong
for (i in 1:1)
tab_sector[1:48,i] <-
tapply(get(paste("employee",1997-1+i, "[birth<=(1997-1+i)]",sep="")),
ordered(sic2digit[birth<=(1997-1+i)],levels=tab_sector_list))
# Error in get(paste("employee", 1997 - 1 + i,
# "[birth<=(1997-1+i))]", : object 'employee97[birth<=(1997-1+i)]' not found
but the variable is there:
head(employee97[birth<=(1997-1+i)])
# [1] 1 2 2 1 3 4
a simpler version where "employee" is not conditioned by "birth" works
It would help if you told us what you are trying to accomplish.
In your code the get function is looking for a variable whose name is "'employee97[birth<=(1997-1+i)]", the code that works is finding a variable whose name is "employee1997" then subsetting it, those are very different. The get function does not do subsetting.
Part of what you are trying to do is FAQ 7.21, the most important part of which is the end where it suggests storing your data in lists to make accessing easier.
You can't get an indexed element, e.g. get("x[i]") fails: you need get("x")[i].
Your code is almost too messy too see what's going on, but this is an attempt at a translation:
for (i in 1:1){
ind <- 1997-1+i
v1 <- get(paste0("employee",ind))
tab_sector[1:48,i] <- tapply(v1[birth<=ind],
ordered(sic2digit[birth<=ind],levels=tab_sector_list))
}
Related
I have come across a behaviour of R that I can't quite get my head around and after a while browsing older threads I am still not sure what it is that I am not getting right.
Here is a minimal example:
ls <- list(c(1,2), c(3,4))
names(ls) <- c("one", "two")
Creates a list with the the following structure:
$one
1 2
$two
3 4
I can access an element as:
ls$one
Which returns
1 2
But if I try with a loop e. g.:
for (i in names(ls)) {
print(ls$i)
}
It returns
NULL
NULL
Does someone know what is the problem? Is it some slight modification that is needed or R prefers a fundamentally different approach for solving such problems?
I'm very new to R, and I would like to know what is the best way to call a different column using for loop.
My code goes like this:
Variables <- c("Var1","Var2","Var3","Var4","Var5","Var6","Var7")
Years <- c(2015,2016,2017,2018)
for (Year in Years) {
for (Var in Variables) {
TT = auc(data[data$Def_Year==Year,]$Good_Bad,
data[data$Def_Year==Year,]$Var)
print (TT)
}
}
I'm tryng to calculate the AUC (area under roc curve) for each variable in each year in order to check the credit scoring model performance stability.
The thing is R does not understand the $Var command. In excel I sometimes use & to overcome such obstacles. I would love to hear your recommendations.
Hi you could do something like this. See my sample code below
df <- data.frame(v1 = c(1,2,3), v2 = c(4,5,6))
variables <- c("v1", "v2")
for(var in variables) {
print(df[, var])
}
Output:
[1] 1 2 3
[1] 4 5 6
I have not solved your code directly as it is not advised on SO to solve the task fully but rather to give general guideline towards solution. I would suggest you go through this: https://stats.idre.ucla.edu/r/modules/subsetting-data/ to better understand subsetting in R.
Also see https://cran.r-project.org/doc/manuals/R-lang.html#Indexing to understand the indexing in R.
From above:
The form using $ applies to recursive objects such as lists and pairlists. It allows only a literal character string or a symbol as the index. That is, the index is not computable: for cases where you need to evaluate an expression to find the index, use x[[expr]]. Applying $ to a non-recursive object is an error.
I am rather new to R and struggling at the moment with a specific issue. I need to iterate over a dataframe with 1 variable returned from a SQL database so that I can ultimately issue additional SQL queries using the information in the 1 variable. I need help understanding how to do this.
Here is what I have
> dt
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by using it prints the entire list just as above
> dt[1]
Col
1 5D2D3F03-286E-4643-8F5B-10565608E5F8
2 582771BE-811E-4E45-B770-42A98EB5D7FB
3 4EB4D553-C680-4576-A854-54ED817226B0
4 80D53D5D-80D1-4A60-BD86-C85F6D53390D
5 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
when trying to access by dt[1,] it brings additional unwanted information.
> a<-dt[1,]
> a
[1] 5D2D3F03-286E-4643-8F5B-10565608E5F8
5 Levels: 4EB4D553-C680-4576-A854-54ED817226B0 ... 9EF6CABF-0A4F-4FA9-9FD9-132589CAAC31
I need to isolate just the '5D2D3F03-286E-4643-8F5B-10565608E5F8' information and not the '5 levels......'.
I am sure this is simple, I just can't find it. any help is appreciated!
thanks!
There are two issues you need to address. One is that you want character data, not a factor variable (a factor is essentially a category variable). The other is that you want a simple vector of the values, not a data.frame.
1) To get the first column as a vector, use double-brackets or the $ notation:
a <- dt[[1]]
a <- dt[['Col']]
a <- dt$Col
Your notation dt[1,] does actually return the column as a vector too, but using the somewhat obscure fact that the [ method for data.frame objects will silently "drop" its value to a vector when using the two-index form dt[i,j], but not when using the one-index form dt[i]:
When [ and [[ are used with a single vector index (x[i] or x[[i]]), they index the data frame as if it were a list. In this usage a drop argument is ignored, with a warning.
Think of "dropping" like unboxing the data - instead of getting a data.frame with a single column, you're just getting the column data itself.
2) To convert to character data, use one of the suggestions in the comments from #akrun or #Vlo:
a <- as.character(dt[[1]])
a <- as.character(dt[['Col']])
a <- as.character(dt$Col)
or use the API of whatever you're using to make the SQL query - or to read in the results of the query - not convert the strings to factors in the first place.
I am a big fan and massive user of data.tables in R. I really use them for a lot of code but have recently encountered a strange bug:
I have a huge data.table with multiple columns, example:
x y
1: 1 a
2: 1 b
3: 1 c
4: 2 a
5: 2 b
6: 2 c
7: 3 a
8: 3 b
9: 3 c
if I select
dataDT[x==‘1’]
I end up getting
x y
1: 1 a
whereas
dataDT[(x==‘1’)]
gives me
x y
1: 1 a
2: 1 b
3: 1 c
Any ideas? x and y are factor and the data.table is indexed by setKey by x.
ADDITIONAL INFOS AND CODE:
I actually fixed this issue but in a way that is not clear nor intuitive.
My code is structured as follows: I have a function called from my main code where I have to introduce a column in the data.table.
I have previously used the following notation
dataT[,nC:=oC,]
to do the deed.
I have instead found that creating the new column by using
dataT$nC <- dataT$oC
instead fixes the bug completely.
I tried to replicate the exact same bug on a simpler example code but I cannot, possibly because of dependencies related to the size structure of my data.table as well as the specific functions I am running on my table.
With that said, I have a working example that shows that when you insert a column using the dataT[,nC:=oC,] notation, it acts as if the table were passed by reference to the function rather than by value.
Also, interestingly enough, while performing
dataDT[x==‘1’]
vs
dataDT[(x==‘1’)]
shows the same result, the latter is 10 times slower, which I have noticed previously. I hope this code can shed some light.
rm(list=ls())
library(data.table)
superParF <- function(dtInput){
dtInputP <- dtInput[a==1]
dtInputN <- dtInput[a==2]
outDT <- rbind(dtInputP[,sum(y),by='x'],
dtInputN[,sum(y),by='x'])
return(outDT)
}
superFunction <- function(dtInput){
#create new column
dtInput[,z:=y,]
#run function
outDT <- rbindlist(lapply(unique(inputDT$x),
function(i)
superParF(inputDT[x==i])))
#output outDT
return(outDT)
}
inputDT <- data.table(x = c(rep(1,100000),
rep(2,100000),
rep(3,100000),
rep(4,100000),
rep(5,100000)),
y= c(rep(1:100000,5)))
inputDT$x <- as.factor(inputDT$x)
inputDT$y <- as.numeric(inputDT$y)
inputDT <- rbind(inputDT,inputDT)
inputDT$a <- c(rep(1,500000),rep(2,500000))
setkey(inputDT,x)
#first observation-> the two searches do not work with the same performance
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
out <- superFunction(inputDT)
a <- system.time(inputDT[x=='1'])
b <- system.time(inputDT[(x=='1')])
print(a)
print(b)
inputDT
I asked in comments to provide the version number and to follow the guidelines on the Support page. It contains :
Read and search the README.md. Is there a bug fix or a new feature related to your issue? Probably we were aware of the issue or someone else reported it and we have already fixed the issue in the current development version.
So, searching the README.md for the string "index" just using Ctrl-F in the browser, yields :
21 Auto indexing handles logical subset of factor column using numeric value properly, #1361. Thanks #mplatzer.
26 Auto indexing returns order of subset properly when input data.table is already sorted, #1495. Thanks #huashan for the nice
reproducible example.
Those are fixed in v1.9.7 easily installed with one command detailed on the Installation page.
The first one (item 21) looks suspiciously close to your issue. So please do try v1.9.7 as requested on the Support page in point 4.
We ask for you state the version number up front to save time because we want to ensure you are using at least v1.9.6 on CRAN and not v1.9.4 which had this problem :
DT[column == value] no longer recycles value except in the length 1 case (when it still uses DT's key or an automatic secondary key, as introduced in v1.9.4). If length(value)==length(column) then it works element-wise as standard in R. Otherwise, a length error is issued to avoid common user errors. DT[column %in% values] still uses DT's key (or an an automatic secondary key) as before. Automatic indexing (i.e., optimization of == and %in%) may still be turned off with options(datatable.auto.index=FALSE).
So which version are you running please and have you tried v1.9.7 since it looks like it's worth a try?
Using the dT[,Column:=Value] notation seems to cause the SAME BUG in another post as well!
data.table not recognising logical in filter
Replacing dT[,Column:=Value] with dT$Column <- Value fixes both my bug and this posts bug.
#Matt Dowle: this post that I am linking has much more succinct code that I have and the bug is the same! You would find it of great help in your quest to fix this issue!
I am scoring a psychometric instrument at work and want to recode a few variables. Basically, each question has five possible responses, worth 0 to 4 respectively. That is how they were coded into our database, so I don't need to do anything except sum those. However, there are three questions that have reversed scores (so, when someone answers 0, we score that as 4). Thus, I am "reversing" those ones.
The data frame basically looks like this:
studyid timepoint date inst_q01 inst_q02 ... inst_q20
1 2 1995-03-13 0 2 ... 4
2 2 1995-06-15 1 3 ... 4
Here's what I've done so far.
# Survey Processing
# Find missing values (-9) and confusions (-1), and sum them
project_f03$inst_nmiss <- rowSums(project_f03[,4:23]==-9)
project_f03$inst_nconfuse <- rowSums(project_f03[,4:23]==-1)
project_f03$inst_nmisstot <- project_f03$inst_nmiss + project_f03$inst_nconfuse
# Recode any missing values into NAs
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
rm(x)
Now, everything so far is pretty fine, I am about to recode the three reversed ones. Now, my initial thought was to do a simple loop through the three variables, and do a series of assignment statements something like below:
# Questions 3, 11, and 16 are reversed
for(x in c(3,11,16)+3) {
project_f03[project_f03[,x]==4,x] <- 5
project_f03[project_f03[,x]==3,x] <- 6
project_f03[project_f03[,x]==2,x] <- 7
project_f03[project_f03[,x]==1,x] <- 8
project_f03[project_f03[,x]==0,x] <- 9
project_f03[,x] <- project_f03[,x]-5
}
rm(x)
So, the five assignment statements just reassign new values, and the loop just takes it through all three of the variables in question. Since I was reversing the scale, I thought it was easiest to offset everything by 5 and then just subtract five after all recodes were done. The main issue, though, is that there are NAs and those NAs result in errors in the loop (naturally, NA==4 returns an NA in R). Duh - forgot a basic rule!
I've come up with three alternatives, but I'm not sure which is the best.
First, I could obviously just move the NA-creating code after the loop, and it should work fine. Pros: easiest to implement. Cons: Only works if I am receiving data with no innate (versus created) NAs.
Second, I could change the logic statement to be something like:
project_f03[!is.na(project_f03[,x]) && project_f03[,x]==4,x] which should eliminate the logic conflict. Pros: not too hard, I know it works. Cons: A lot of extra code, seems like a kludge.
Finally, I could change the logic from
project_f03[project_f03[,x]==4,x] <- 5 to
project_f03[project_f03[,x] %in% 4,x] <- 5. This seems to work fine, but I'm not sure if it's a good practice, and wanted to get thoughts. Pros: quick fix for this issue and seems to work; preserves general syntatic flow of "blah blah LOGIC blah <- bleh". Cons: Might create black hole? Not sure what the potential implications of using %in% like this might be.
EDITED TO MAKE CLEAR
This question has one primary component: Is it safe to utilize %in% as described in the third point above when doing logical operations, or are there reasons not to do so?
The second component is: What are recommended ways of reversing the values, like some have described in answers and comments?
The straightforward answer is that there is no black hole to using %in%. But in instances where I want to just discard the NA values, I'd use which: project_f03[which(project_f03[,x]==4),x] <- 5
%in% could shorten that earlier bit of code you had:
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
#could be
for(x in 4:23) {project_f03[project_f03[,x] %in% c(-9,-1), x] <- NA}
Like #flodel suggested, you can replace that whole block of code in your for-loop with project_f03[,x] <- rev(0:4)[match(project_f03[,x], 0:4, nomatch=10)]. It should preserve NA. And there are probably more opportunities to simplify code.
It doesn't answer your question, but should fix your problem:
cols <- c(3,11,16)+3
project_f03[, cols] <- abs(project_f03[, cols]-4)
## or a lot of easier (as #TylerRinker suggested):
project_f03[, cols] <- max(project_f03[, cols]) - project_f03[, cols]