I am trying to remove useless column from a data frame. I used a while loop with an if statement, and it seems that it never leaves the if statement :
it = 1
while (it < ncol(testing))
{
if ("drop" %in% CategOfData[it,])
{testing[,it]<-NULL}
else it = it + 1
}
the if loop works as long as it's not nested in a while loop.
testing is my data frame containing 400 rows and 12 columns,
CategOfData is a data frame of 12 rows and 2 columns, CategOfData contains the header of my df "testing" and the categories of it, 3 rows contain the word "drop"
I tested this code by replacing {testing[,it]<-NULL} with {jkl = jkl + 0.5},
And again the code ran long, I cut it short, asked the console what the value of jkl was, and it returned a number well over 800 000, while it should have returned 2.5 (1 + 3*0.5)
I don't understand why it nevers enters the else part of the code. which makes the while loop infinite since "it" never incrementes
I would use a for loop, but R doesn't agree since I'm dropping columns as I go.
the type of CategOfData :
> CategOfData [1,]
header x
"PIERRE.MARIE" "drop"
and "testing"
> head(testing[,1])
[1] PIERRE-MARIE PIERRE-MARIE PIERRE-MARIE PIERRE-MARIE PIERRE-MARIE PIERRE-MARIE
Levels: LAURENNE PIERRE-MARIE
Can you help me pinpoint where the problem lies please?
I tried this instead
it = 1
rem = ncol(testing)
while (it < rem)
{
if ("drop" %in% CategOfData2[it,])
{testing[,it]<-NULL
CategOfData2 = CategOfData2[-it,]
rem = rem - 1
}
it = it + 1
}
It works ~Ok, but doesn't remove the last column which is a drop
Your problem can be solved with two lines of R code:
drop <- apply(CategOfData, 1, function(x) { "drop" %in% x })
testing <- testing[, !drop]
And this is a good example of the kind of power which R has, when you use it correctly.
Related
When working with R I frequently get the error message "subscript out of bounds". For example:
# Load necessary libraries and data
library(igraph)
library(NetData)
data(kracknets, package = "NetData")
# Reduce dataset to nonzero edges
krack_full_nonzero_edges <- subset(krack_full_data_frame, (advice_tie > 0 | friendship_tie > 0 | reports_to_tie > 0))
# convert to graph data farme
krack_full <- graph.data.frame(krack_full_nonzero_edges)
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
# Calculate reachability for each vertix
reachability <- function(g, m) {
reach_mat = matrix(nrow = vcount(g),
ncol = vcount(g))
for (i in 1:vcount(g)) {
reach_mat[i,] = 0
this_node_reach <- subcomponent(g, (i - 1), mode = m)
for (j in 1:(length(this_node_reach))) {
alter = this_node_reach[j] + 1
reach_mat[i, alter] = 1
}
}
return(reach_mat)
}
reach_full_in <- reachability(krack_full, 'in')
reach_full_in
This generates the following error Error in reach_mat[i, alter] = 1 : subscript out of bounds.
However, my question is not about this particular piece of code (even though it would be helpful to solve that too), but my question is more general:
What is the definition of a subscript-out-of-bounds error? What causes it?
Are there any generic ways of approaching this kind of error?
This is because you try to access an array out of its boundary.
I will show you how you can debug such errors.
I set options(error=recover)
I run reach_full_in <- reachability(krack_full, 'in')
I get :
reach_full_in <- reachability(krack_full, 'in')
Error in reach_mat[i, alter] = 1 : subscript out of bounds
Enter a frame number, or 0 to exit
1: reachability(krack_full, "in")
I enter 1 and I get
Called from: top level
I type ls() to see my current variables
1] "*tmp*" "alter" "g"
"i" "j" "m"
"reach_mat" "this_node_reach"
Now, I will see the dimensions of my variables :
Browse[1]> i
[1] 1
Browse[1]> j
[1] 21
Browse[1]> alter
[1] 22
Browse[1]> dim(reach_mat)
[1] 21 21
You see that alter is out of bounds. 22 > 21 . in the line :
reach_mat[i, alter] = 1
To avoid such error, personally I do this :
Try to use applyxx function. They are safer than for
I use seq_along and not 1:n (1:0)
Try to think in a vectorized solution if you can to avoid mat[i,j] index access.
EDIT vectorize the solution
For example, here I see that you don't use the fact that set.vertex.attribute is vectorized.
You can replace:
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
by this:
## set.vertex.attribute is vectorized!
## no need to loop over vertex!
for (attr in names(attributes))
krack_full <<- set.vertex.attribute(krack_full,
attr, value = attributes[,attr])
It just means that either alter > ncol( reach_mat ) or i > nrow( reach_mat ), in other words, your indices exceed the array boundary (i is greater than the number of rows, or alter is greater than the number of columns).
Just run the above tests to see what and when is happening.
Only an addition to the above responses: A possibility in such cases is that you are calling an object, that for some reason is not available to your query. For example you may subset by row names or column names, and you will receive this error message when your requested row or column is not part of the data matrix or data frame anymore.
Solution: As a short version of the responses above: you need to find the last working row name or column name, and the next called object should be the one that could not be found.
If you run parallel codes like "foreach", then you need to convert your code to a for loop to be able to troubleshoot it.
If this helps anybody, I encountered this while using purr::map() with a function I wrote which was something like this:
find_nearby_shops <- function(base_account) {
states_table %>%
filter(state == base_account$state) %>%
left_join(target_locations, by = c('border_states' = 'state')) %>%
mutate(x_latitude = base_account$latitude,
x_longitude = base_account$longitude) %>%
mutate(dist_miles = geosphere::distHaversine(p1 = cbind(longitude, latitude),
p2 = cbind(x_longitude, x_latitude))/1609.344)
}
nearby_shop_numbers <- base_locations %>%
split(f = base_locations$id) %>%
purrr::map_df(find_nearby_shops)
I would get this error sometimes with samples, but most times I wouldn't. The root of the problem is that some of the states in the base_locations table (PR) did not exist in the states_table, so essentially I had filtered out everything, and passed an empty table on to mutate. The moral of the story is that you may have a data issue and not (just) a code problem (so you may need to clean your data.)
Thanks for agstudy and zx8754's answers above for helping with the debug.
I sometimes encounter the same issue. I can only answer your second bullet, because I am not as expert in R as I am with other languages. I have found that the standard for loop has some unexpected results. Say x = 0
for (i in 1:x) {
print(i)
}
The output is
[1] 1
[1] 0
Whereas with python, for example
for i in range(x):
print i
does nothing. The loop is not entered.
I expected that if x = 0 that in R, the loop would not be entered. However, 1:0 is a valid range of numbers. I have not yet found a good workaround besides having an if statement wrapping the for loop
This came from standford's sna free tutorial
and it states that ...
# Reachability can only be computed on one vertex at a time. To
# get graph-wide statistics, change the value of "vertex"
# manually or write a for loop. (Remember that, unlike R objects,
# igraph objects are numbered from 0.)
ok, so when ever using igraph, the first roll/column is 0 other than 1, but matrix starts at 1, thus for any calculation under igraph, you would need x-1, shown at
this_node_reach <- subcomponent(g, (i - 1), mode = m)
but for the alter calculation, there is a typo here
alter = this_node_reach[j] + 1
delete +1 and it will work alright
What did it for me was going back in the code and check for errors or uncertain changes and focus on need-to-have over nice-to-have.
I am trying to print the "result" of using table function, but when I tried to use the code here, I got something very strange:
for (i in 1:4){
print (table(paste("group",i,"$", "BMI_obese",sep=""), paste("group",i,"$","A1.1", sep="")))
}
This is the result in R output:
group1$A1.1
group1$BMI_obese 1
group2$A1.1
group2$BMI_obese 1
group3$A1.1
group3$BMI_obese 1
group4$A1.1
group4$BMI_obese 1
But when I type out the statement without typing inside the loop:
table(group2$BMI_obese, group2$A1.1)
I got what I want:
1 2 3 4 5
0 51 20 9 8 0
1 37 20 15 6 4
Does anyone know which part of my for loop code is not correct or can be modified to fit my purpose of printing the loop table result?
Hi, all but now I have another problem. I am trying to add an inner loop which will take the column name as an argument, because I would like to loop through mulitiple column for each of the group data (i.e. for group1, I would like to have table of BMI_obese vs A1.1, BMI_obese vs A1.2 ... BMI_obese vs A1.15. This is my code, but somehow it is not working, I think it is because it is not recognizing the A1.1, A1.2,... as an column taking from the data group1, group2, group3, group4. But instead it is treated as a string I think. I am not sure how to fix it:
for (i in 2:4) {
for (j in c("A1.1","A1.2"))
{
print(with(get(paste0("group", i)),table(BMI_obese,j)))
}
}
I keep getting this error message:
Error in table(BMI_obese, j) : all arguments must have the same length
Okay, you are trying to construct a variable name using paste and then do a table. You are simply passing the name of the variable to table, not the variable object itself. For this sort of approach you want to use get()
for (i in 1:4) {
with(get(paste0("group", i), table(BMI_obese, A1.1))
}
#example saving as a list (using lapply rather than for loop)
group1 <- data.frame(x=LETTERS[1:10], y=(1:10)[sample(10, replace=TRUE)])
group2 <- data.frame(x=LETTERS[1:10], y=(1:10)[sample(10, replace=TRUE)])
result <- lapply(1:2, function(i) with(get(paste0("group", i)), table(x, y)))
#look at first six rows of each:
head(result[[1]])
head(result[[2]])
#example illustrating fetching objects from a string name
data(mtcars)
head(with(get("mtcars"), table(disp, cyl)))
head(with(get("mtcars"), table(disp, "cyl")))
#Error in table(disp, "cyl") : all arguments must have the same length
head(with(get("mtcars"), table(disp, get("cyl"))))
You could also use a combination of eval and parse like this:
x1 <- c(sample(10, 100, replace = TRUE))
y1 <- c(sample(10, 100, replace = TRUE))
table(eval(parse(text = paste0("x", 1))),
eval(parse(text = paste0("y", 1))))
But I'd also say it is not the nicest practice to access variables that way...
Your types are used wrong. See the difference:
table(group2$BMI_obese, group2$A1.1)
and
table(paste(...),paste(...))
So what type does paste return? Certainly some string.
EDIT:
paste(...) was not meant to be syntactically correct but an abbreviation for paste("group",i,"$", "BMI_obese",sep=""), or whatever you paste together.
paste(...) is returning some string. If you put that result into a table, you get a table of strings (the unexpected result that you got). What you want to do is acessing variables or fields with the name which is returned by your paste(...). Just an an eval to your paste like Daniel said and do it like this.
for (i in 1:4){
print (table(eval(paste("group",i,"$", "BMI_obese",sep="")),eval(paste("group",i,"$","A1.1", sep=""))))
}
I am trying to understand the for and if-statement in r, so I run a code where I am saying that if the sum of rows are bigger than 3 then return 1 else zero:
Here is the code
set.seed(2)
x = rnorm(20)
y = 2*x
a = cbind(x,y)
hold = c()
Now comes the if-statement
for (i in nrow(a)) {
if ([i,1]+ [i,2] > 3) hold[i,] == 1
else ([i,1]+ [i,2]) <- hold[i,] == 0
return (cbind(a,hold)
}
I know that maybe combining for and if may not be ideal, but I just want to understand what is going wrong. Please keep the explanation at a dummy level:) Thanks
You've got some issues. #mnel covered a better way to go about doing this, I'll focus on understanding what went wrong in this attempt (but don't do it this way at all, use a vectorized solution).
Line 1
for (i in nrow(a)) {
a has 20 rows. nrow(a) is 20. Thus your code is equivalent to for (i in 20), which means i will only ever be 20.
Fix:
for (i in 1:nrow(a)) {
Line 2
if ([i,1]+ [i,2] > 3) hold[i,] == 1
[i,1] isn't anything, it's the ith row and first column of... nothing. You need to reference your data: a[i,1]
You initialized hold as a vector, c(), so it only has one dimension, not rows and columns. So we want to assign to hold[i], not hold[i,].
== is used for equality testing. = or <- are for assignment. Right now, if the >3 condition is met, then you check if hold[i,] is equal to 1. (And do nothing with the result).
Fix:
if (a[i,1]+ a[i,2] > 3) hold[i] <- 1
Line 3
else ([i,1]+ [i,2]) <- hold[i,] == 0
As above for assignment vs equality testing. (Here you used an arrow assignment, but put it in the wrong place - as if you're trying to assign to the else)
else happens whenever the if condition isn't met, you don't need to try to repeat the condition
Fix:
else hold[i] <- 0
Fixed code together:
for (i in 1:nrow(a)) {
if (a[i,1] + a[i,2] > 3) hold[i] <- 1
else hold[i] <- 0
}
You aren't using curly braces for your if and else expressions. They are not required for single-line expressions (if something do this one line). They are are required for multi-line (if something do a bunch of stuff), but I think they're a good idea to use. Also, in R, it's good practice to put the else on the same line as a } from the preceding if (inside the for loop or a function it doesn't matter, but otherwise it would, so it's good to get in the habit of always doing it). I would recommend this reformatted code:
for (i in 1:nrow(a)) {
if (a[i, 1] + a[i, 2] > 3) {
hold[i] <- 1
} else {
hold[i] <- 0
}
}
Using ifelse
ifelse() is a vectorized if-else statement in R. It is appropriate when you want to test a vector of conditions and get a result out for each one. In this case you could use it like this:
hold <- ifelse(a[, 1] + a[, 2] > 3, 1, 0)
ifelse will take care of the looping for you. If you want it as a column in your data, assign it directly (no need to initialize first)
a$hold <- ifelse(a[, 1] + a[, 2] > 3, 1, 0)
Such operations in R are nicely vectorised.
You haven't included a reference to the dataset you wish to index with your call to [ (eg a[i,1])
using rowSums
h <- rowSums(a) > 3
I am going to assume that you are new to R and trying to learn about the basic function of the for loop itself. R has fancy functions called "apply" functions that are specifically for doing basic math on each row of a data frame. I am not going to talk about these.
You want to do the following on each row of the array.
Sum the elements of the row.
Test that the sum is greater than 3.
Return a value of 1 or 0 representing the result of 2.
For 1, luckily "sum" is a built in function. It pays off to check out the built in functions within every programming language because they save you time. To sum the elements of a row, just use sum(a[row_number,]).
For 2, you are evaluating a logical statement "is x >3?" where x is the result from 1. The ">3" statement returns a value of true or false. The logical expression is a fancy "if then" statement without the "if then".
> 4>3
[1] TRUE
> 2>3
[1] FALSE
For 3, a true or false value is a data structure called a "logical" value in R. A 1 or 0 value is a data structure called a "numeric" value in R. By converting the "logical" into a "numeric", you can change the TRUE to 1's and FALSE to 0's.
> class(4>3)
[1] "logical"
> as.numeric(4>3)
[1] 1
> class(as.numeric(4>3))
[1] "numeric"
A for loop has a min, a max, a counter, and an executable. The counter starts at the min, and increments until it goes to the max. The executable will run for each run of the counter. You are starting at the first row and going to the last row. Putting all the elements together looks like this.
for (i in 1:nrow(a)){
hold[i] <- as.numeric(sum(a[i,])>3)
}
FYI, I'm new to using R so my code is likely quite clunky. I've done my homework on this but haven't been able to find an "Except" logical operator for R and really need something like that in my code. My input data is a .csv containing integers and null values with 12 columns and 1440 rows.
oneDayData <- read.csv("data.csv") # Loading data
oneDayMatrix <- data.matrix(oneDayData, rownames.force = NA) #turning data frame into a matrix
rowBefore <- data.frame(oneDayData[i-1,10], stringsAsFactors=FALSE) # Creating a variable to be used in the if statement, represents cell before the cell in the loop
ctr <- 0 # creating a counter and zeroing it
for (i in 1:nrow(oneDayMatrix)) {
if ((oneDayMatrix[i,10] == -180) & (oneDayMatrix[i,4] == 0)) { # Makes sure that there is missing data matched with a zero in activityIn
impute1 <- replace(oneDayMatrix[ ,10], oneDayMatrix[i,10], rowBefore)
ctr <- (ctr + 1) # Populating the counter with how many rows get changed
}
else{
print("No data fit this criteria.")
}
}
print(paste(ctr, "rows have been changed.")) # Printing the counter and number of rows that got changed enter code here
I would like to add some kind of EXCEPT condition to my if statement or equivalent that says something like: employ the two previous conditions (see if statement in code) EXCEPT when oneDayMatrix[i-1, 4] > 0. I would really appreciate any help with this and thank you in advance!
"Except" is equivalent to "if not". The "not" operator in R is !. So to add that oneDayMatrix[i-1, 4] > 0 exception, you just need to modify your if statement as follows:
if ((oneDayMatrix[i, 10] == -180) &
(oneDayMatrix[i, 4] == 0) &
!(oneDayMatrix[i-1, 4] > 0)) { ... }
or equivalently:
if ((oneDayMatrix[i, 10] == -180) &
(oneDayMatrix[i, 4] == 0) &
(oneDayMatrix[i-1, 4] <= 0)) { ... }
This goes on top of a couple fixes that need to be made to your code:
as I pointed out, rowBefore is not defined properly: in terms of i which is not defined yet. Inside your for loop, just replace rowBefore with oneDayMatrix[i-1, 10]
as #noah pointed out, you need to start your loop at the second index: for (i in 2:nrow(oneDayMatrix)).
I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like "grep" or "which" to find all the indices of dat.fram[,3] which match each of the elements of X.
This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of "match.ind" can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?
Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .
Code:
match.ind=list()
for(i in 1:150000){
match.ind[[i]]=which(dat.fram[,3]==X[i])
}
UPDATE:
Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!
### define v as a sample column of data - you should define v to be
### the column in the data frame you mentioned (data.fram[,3])
v = sample(1:150000, 1500000, rep=TRUE)
### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points
mybiglist = tapply(seq_along(v),v,c)
### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to
X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]
And that's it! As a check, let's look at the first 3 rows of mylist:
> mylist[1:3]
$`1`
[1] 401143 494448 703954 757808 1364904 1485811
$`2`
[1] 230769 332970 389601 582724 804046 997184 1080412 1169588 1310105
$`4`
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the
numbers listed against 4 are the index points in v where 4 appears:
> which(X==3)
integer(0)
> which(v==3)
[1] 102194 424873 468660 593570 713547 769309 786156 828021 870796
883932 1036943 1246745 1381907 1437148
> which(v==4)
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!
Extra note: You can use the code below to create an NA entry for each member of X not in v...
blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]
Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.
Cheers! :)
ORIGINAL POST BELOW... superseded by the above, obviously!
Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:
X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE),
c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)
tapply(X,X,function(x) {which(d[,3]==x)})