How do I delete all rows based on a loop in R - r

I am writing a for loop to delete rows in which all of the values between rows 5 and 8 is 'NA'. However, it only deletes SOME of the rows. When I do a while loop, it deletes all of the rows, but I have to manually end it (i.e. it is an infinite loop...I also have no idea why)
The for/if loop:
for(i in 1:nrow(df)){
if(is.na(df[i,5]) && is.na(df[i,6]) &&
is.na(df[i,7]) && is.na(df[i,8])){
df<- df[-i,]
}
}
while loop (but it is infinite):
for(i in 1:nrow(df)){
while(is.na(df[i,5]) && is.na(df[i,6]) &&
is.na(df[i,7]) && is.na(df[i,8])){
df<- df[-i,]
}
}
Can someone help? Thanks!

What's happening here is that when you remove a row in this way, all the rows below it "move up" to fill the space left behind. When there are repeated rows that should be deleted, the second one gets skipped over. Imagine this table:
1 keep
2 delete
3 delete
4 keep
Now, you loop through a sequence from 1 to 4 (the number of rows) deleting rows that say delete:
i = 1, keep that row ...
i = 2, delete that row. Now, the data frame looks like this:
1 keep
2 delete
3 keep
i = 3, the 3rd row says keep, so keep it ... The final table is:
1 keep
2 delete
3 keep
In your example with while, however, the deletion step keeps running on row 2 until that row doesn't meet the conditions instead of moving on to i = 3 right away. So the process goes:
i = 1, keep that row ...
i = 2, delete that row. Now, the data frame looks like this:
1 keep
2 delete
3 keep
i = 2 (again), delete that row (again). Now, the data frame looks like this:
1 keep
2 keep
i = 2 (again), this row says keep, so keep it and move on to i = 3
I'd be remiss to answer this question without mentioning that there are much better ways to do this in R such as square bracket notation (enter ?`[` in the R console), the filter function in the dplyr package, or the data.table package.
This question has many options: Filter data.frame rows by a logical condition

Store the row number in a vector and remove outside the loop.
test <- iris
test[1:5,2:4] <- NA
> head(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA setosa
2 4.9 NA NA NA setosa
3 4.7 NA NA NA setosa
4 4.6 NA NA NA setosa
5 5.0 NA NA NA setosa
6 5.4 3.9 1.7 0.4 setosa
x <- 0
for(i in 1:nrow(test)){
if(is.na(test[i,2]) && is.na(test[i,3]) &&
is.na(test[i,4])){
x <- c(x,i)
}
}
x
test<- test[-x,]
head(test)
> head(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa

Related

How to select variables with star symbols in R

I want to select some variables from my csv file in R. I used this select(gender*, age*), but got the error - object not found. I tried select(`gender*`, `age*`) and select(starts_with(gender), starts_with(age)), but neither works. Does anyone know how to select variables with star symbols? Thanks a lot!
It is possible that the select from dplyr is masked by select from any other package as this is working fine. Either specify the packagename with :: or do this on a fresh R session with only dplyr loaded
library(dplyr)
data(iris)
iris$'gender*' <- 'M'
iris%>%
head %>%
dplyr::select(`gender*`)
# gender*
#1 M
#2 M
#3 M
#4 M
#5 M
#6 M
To select a list of column names starting with a specific string, one can use the starts_with() function in dplyr. To illustrate, we'll select the two columns that start with the string Sepal, as in Sepal.Length and Sepal.Width.
library(dplyr)
select(iris,starts_with("Sepal")) %>% head()
...and the output:
> select(iris,starts_with("Sepal")) %>% head()
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
>
We can do the same thing in Base R with grepl() and a regular expression.
# base R version
head(iris[,grepl("^Sepal",names(iris))])
...and the output:
> head(iris[,grepl("^Sepal",names(iris))])
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
>
Also note that if one is using read.csv() to create a data frame in R, it converts any occurrences of * in column headings to ..
# confirm that * is converted to . in read.csv()
textFile <- 'v*1,v*2
1,2
3,4
5,6'
data <- read.csv(text = textFile,header = TRUE)
# see how illegal column name * is converted to .
names(data)
...and the output:
> names(data)
[1] "v.1" "v.2"
>

Calculation via factor results in a by-list - how to circumvent?

I have a data.frame as following:
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 NA
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
My goal is simple but also a bit difficult. Definitely it is doable to solve it in several ways:
I want to apply a function "func" to each row according to a factor, e.g. the factor "Lot". This is done via
m_dist_lot<- by(data.frame, data.frame$Lot,func)
This actually works but the result is a by-list:
data.frame$Lot: 7
354 355 363 367 378 419 426 427 428 431 460 477 836
3.5231249 9.4229589 1.4996504 7.2984485 7.6883170 1.2354754 1.8547674 3.1129814 4.4303001 1.9634573 3.7281868 3.6182559 6.4718306
data.frame$Lot: 8
1 2 11 15 17 18 19 20 21 22 24 25
2.1415352 4.6459868 1.3485551 38.8218984 3.9988686 2.2473563 6.7186047 2.6433790 0.5869746 0.5832567 4.5321623 1.8567318
The first row seems to be the row of the initial data.frame where the data is taken from. The second row are the calculated values.
My problem now is: How can I store these values properly into the origin data.frame according to the correct rows?
For example in case of one certain calculation/row of the data frame:
m_dist_lot<- by(data.frame, data.frame$Lot,func)
results for the second row of the data.frame in
data.frame$Lot: 8
2
4.6459868
I want to store the value 4.6459868 in data.frame$m_dist_lot according to the correct row "2":
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 4.6459868
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
but I don't know how. My best try actually is to use "unlist".
un<- unlist(m_dist_lot) results in
un[1]
6.354
3.523125
un[2]
6.355
9.422959
un[3]
(..)
But I still don't know how I can "separate" the information of "factor.row" and "calculcated" value in such a way that the information is stored correctly in the data frame.
At least when using un<- unlist(m_dist_lot, use.names = FALSE) the factors are not present:
un[1]
3.523125
un[2]
9.422959
un[3]
1.49965
(..)
But now I lack the information of how to assign these values properly into the data.frame.
Using un<- do.call(rbind, lapply(m_dist_lot, data.frame, stringsAsFactors=FALSE)) results in
(...)
7.922 0.94130936
7.976 4.89560441
8.1 2.14153516
8.2 4.64598677
8.11 1.34855514
(...)
Here I still lack a proper assignment of calculated values <> data.frame.
I'm sure there must be a doable way. Do you know a good method?
Without reproducible data or an example of what you want func to do, I am guessing a bit here. However, I think that dplyr is going to be the answer for you.
First, I am going to use the pipe (%>%) from dplyr (exported from magrittr) to pass the builtin iris data through a series of functions. If what you are trying to calculate requires the full data.frame (and not just a column or two), you could modify this approach to do what you want (just write your function to take a data.frame, add the column(s) of interest, then return the full data.frame).
Here, I first split the iris data by Species (this creates a list, with a separate data.frame for each species). Next, I use lapply to run the function head on each element of the list. This returns a list of data.frames that now each only have three rows. (You could replace head with your function of interest here, as long as it returns a full data.frame.) Finally, I stitch each element of the list back together with bind_rows.
topIris <-
iris %>%
split(.$Species) %>%
lapply(head, n = 3) %>%
bind_rows()
This returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6.0 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3.0 5.9 2.1 virginica
Which I am going to use to illustrate the approach that I think will actually address your underlying problem.
The group_by function from dplyr allows a similar approach, but without having to split the data.frame. When a data.frame is grouped, any functions applied to it are applied separately by group. Here is an example in action, which ranks the sepal lengths within each species. This is obviously not terribly useful directly, but you could write a custom function which took any number of columns as arguments (which are then passed in as vectors) and returned a vector of the same length (to create a new column or update an existing one). The select function at the end is only there to make it easier to see what I did
topIris %>%
group_by(Species) %>%
mutate(rank_Sepal_Length = rank(Sepal.Length)) %>%
select(Species, rank_Sepal_Length, Sepal.Length)
Returns:
Species rank_Sepal_Length Sepal.Length
<fctr> <dbl> <dbl>
1 setosa 3 5.1
2 setosa 2 4.9
3 setosa 1 4.7
4 versicolor 3 7.0
5 versicolor 1 6.4
6 versicolor 2 6.9
7 virginica 2 6.3
8 virginica 1 5.8
9 virginica 3 7.1
I got a workaround with the help of Force gsub to keep trailing zeros :
un<- do.call(rbind, lapply(list, data.frame, stringsAsFactors=FALSE))
un<- gsub(".*.","", un)
un<- regmatches(un, gregexpr("(?<=.).*", un, perl=TRUE))
rows<- data.frame(matrix(ncol = 1, nrow = lengths(un)))
colnames(rows)<- c("row_number")
rows["row_number"]<- sprintf("%s", rownames(un))
rows["row_number"]<- as.numeric(un[,1])
rows["row_number"]<- sub("^[^.]*[.]", "", format(rows[,1], width = max(nchar(rows[,1]))))

Is there a general inverse of the table() function?

I am aware that a little programming allows converting fixed-dimension frequency tables, as returned e.g. by table(), back into observation data. So the aim is to convert a frequency table such as this one...
(flower.freqs <- with(iris,table(Petal=cut(Petal.Width,2),Species)))
Species
Petal setosa versicolor virginica
(0.0976,1.3] 50 28 0
(1.3,2.5] 0 22 50
...back into a data.frame() with a row number that corresponds to the sum of the numbers of the input matrix, while the cell values are obtained from input dimensions:
Petal Species
1 (0.0976,1.3] setosa
2 (0.0976,1.3] setosa
3 (0.0976,1.3] setosa
# ... (150 rows) ...
With some tinkering I build a rough prototype that should also digest higher-dimensional inputs:
tableinv <- untable <- function(x) {
stopifnot(is.table(x))
obs <- as.data.frame(x)[rep(1:prod(dim(x)),c(x)),-length(dim(x))-1]
rownames(obs) <- NULL; obs
}
> head(tableinv(flower.freqs)); dim(tableinv(flower.freqs))
Petal Species
1 (0.0976,1.3] setosa
2 (0.0976,1.3] setosa
3 (0.0976,1.3] setosa
4 (0.0976,1.3] setosa
5 (0.0976,1.3] setosa
6 (0.0976,1.3] setosa
[1] 150 2
> head(tableinv(Titanic)); nrow(tableinv(Titanic))==sum(Titanic)
Class Sex Age Survived
1 3rd Male Child No
2 3rd Male Child No
3 3rd Male Child No
4 3rd Male Child No
5 3rd Male Child No
6 3rd Male Child No
[1] TRUE
I am obviously proud that this bricolage reconstructs multi-attribute data.frame()s from higher-dimensional frequency tables such as Titanic - but is there an established (built-in, battle-tested) general inverse to table(), ideally one that does not depend on a specific library, that knows how to handle unlabeled dimensions, that is optimized so that it will not choke on bulky inputs, and that reasonably deals with table inputs that would correspond to factor as well as non-factor observation inputs?
I believe that your solution is pretty good. In any case, the way I would address this question is quite similar:
tableinv <- function(x){
y <- x[rep(rownames(x),x$Freq),1:(ncol(x)-1)]
rownames(y) <- c(1:nrow(y))
return(y)}
survivors <- as.data.frame(Titanic)
surv.invtab <- tableinv(survivors)
which yields
> head(surv.invtab)
Class Sex Age Survived
1 3rd Male Child No
2 3rd Male Child No
3 3rd Male Child No
4 3rd Male Child No
5 3rd Male Child No
6 3rd Male Child No
Concerning the example with the flowers, using the function tableinv() as defined above, it would first be necessary to convert the data into a data frame:
flower.freqs <- with(iris,table(Petal=cut(Petal.Width,2),Species))
flower.freqs <- as.data.frame(flower.freqs)
flower.invtab <- tableinv(flower.freqs)
The result in this case is
> head(flower.invtab)
Petal Species
1 (0.0976,1.3] setosa
2 (0.0976,1.3] setosa
3 (0.0976,1.3] setosa
4 (0.0976,1.3] setosa
5 (0.0976,1.3] setosa
6 (0.0976,1.3] setosa
Hope this helps.
In the specific case where we deal with one-dimension frequency data, there is an easy way. Let's take an example:
mytable = table(mtcars$cyl)
#### 4 6 8
#### 11 7 14
A simple function to retrieve expanded data:
InvTable = function(tb, random = TRUE){
output = rep(names(tb), tb)
if (random) { output <- base::sample(output, replace=FALSE) }
return(output)
}
InvTable(mytable, T)
#### [1] "4" "8" "8" "4" "4" "6" "6" ...
This is not exactly the need of the user, but I think it could be very helpful in many similar cases.
Just beware that the result is in character format, which is not always what we need (so add a as.numeric if needed).

generate an output from a calculation between 2 columns in R

I have a data set representing movement through a 2d environment with respect to time:
time(s) start_pos fwd_dist rev_dist end_pos
1 0.0 4.0 -3.0 2.0
2 2.0 5.1 0.5 3.0
3 3.0 4.7 -0.5 3.5
4 3.5 3.6 -1.8 2.1
5 2.1 2.6 -2.1 1.0
6 1.0 1.5 -1.5 -0.2
I want to make another column which is the result of a check to see which is larger between "end_pos" and "start_pos" and subtracting the larger number from "fwd_dist". I'm trying to loop through the dataset but seem to be struggling with the syntax in R
i<-0
while (i < length(data[,1]){if (data[i,4] > data[i,1]){print (data[i,2]-data[i,4])} else {print (data[i,2]-data[i,1])}; i<-i+1}
I keep getting the error:
Error in if (data[i, 4] > data[i, 1]) { :
argument is of length zero
pmax(start_pos,end_pos)
will give you the parallel maximum (i.e., componentwise) of two vectors. So you are probably looking for
fwd_dist-pmax(start_pos,end_pos)
A data frame based approach:
data$difference <- data$fwd_dist - pmax(data$start_pos, data$end_pos)

how to suppress output when using `:=` in R {data.table}, prior to v1.8.3?

Is there a way to prevent data.table to print the new data.table after assigning a new column by reference? I gather standard behaviour is
library(data.table)
example(data.table)
DT
# x y v
# 1: a 1 42
# 2: a 3 42
# 3: a 6 42
# 4: b 1 11
# 5: b 3 11
# 6: b 6 11
# 7: c 1 7
# 8: c 3 8
# 9: c 6 9
DT[,z:=1:nrow(DT)]
# x y v z
# 1: a 1 42 1
# 2: a 3 42 2
# 3: a 6 42 3
# 4: b 1 11 4
# 5: b 3 11 5
# 6: b 6 11 6
# 7: c 1 7 7
# 8: c 3 8 8
# 9: c 6 9 9
i.e. the table is printed to screen after assignment. is there a way to stop data.table from showing the new table after assigning the new column z? I know I can stop this behaviour by saying
DT <- copy(DT[,z:=1:nrow(DT)])
but that is defeating the purpose of := (which is designed to avoid copies).
Since <-.data.table doesn't make a copy, you can use <-:
Create a data.table object:
library(data.table)
di <- data.table(iris)
Create a new column:
di <- di[, z:=1:nrow(di)]
di
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species z
# [1,] 5.1 3.5 1.4 0.2 setosa 1
# [2,] 4.9 3.0 1.4 0.2 setosa 2
# [3,] 4.7 3.2 1.3 0.2 setosa 3
# [4,] 4.6 3.1 1.5 0.2 setosa 4
# [5,] 5.0 3.6 1.4 0.2 setosa 5
# [6,] 5.4 3.9 1.7 0.4 setosa 6
# [7,] 4.6 3.4 1.4 0.3 setosa 7
# [8,] 5.0 3.4 1.5 0.2 setosa 8
# [9,] 4.4 2.9 1.4 0.2 setosa 9
# [10,] 4.9 3.1 1.5 0.1 setosa 10
# First 10 rows of 150 printed.
It is also worth remembering that R only prints the value of an object in interactive mode.
So, in batch mode, you can simply use:
di[, z:=1:nrow(di)]
This will not produce any output when run as a script in batch mode.
Further info from Matthew Dowle:
Also see FAQ 2.21 and 2.22 :
2.21 Why does DT[i,col:=value] return the whole of DT? I expected either no visible value (consistent with <-), or a message or return value containing how many rows were updated. It isn't obvious that the data has indeed been updated by reference.
So that compound syntax can work; e.g., DT[i,done:=TRUE][,sum(done)]. The number of rows updated is returned when verbosity is on, either on a per query basis or globally using options(datatable.verbose=TRUE).
2.22 Ok, but can't the return value of DT[i,col:=value] be returned invisibly, then?
We tried to but R internally forces visibility on for [. The value of
FunTab's eval column (see src/main/names.c) for [ is 0 meaning force
R_Visible on (see R-Internals section 1.6). Therefore, when we tried
invisible() or setting R_Visible to 0 directly ourselves, eval in
src/main/eval.c would force it on again.
After getting used to this behaviour, you might grow to prefer it (we have). After all, how many times do we subassign using <- and then immediately look at the data to check it's ok?
We can mix := into a j which also returns data; a mixed update and select in one query. To detect whether j solely updates (and then behave dierently) could be confusing.
Second update from Matthew Dowle:
We have now found a solution and v1.8.3 no longer prints the result when := is used. We will update FAQ 2.21 and 2.22.
For a very long data table name, it seems that the following is equivalent in performance and can be shorter (I prefer short names but sometimes need a longer name to remember what an object really contains):
invisible(Very.Long.Data.Table[i,col:=value])
compare with:
Very.Long.Data.Table<-Very.Long.Data.Table[i,col:=value]

Resources