Storing data frame output from a while loop in R - r

Let's say I have a function called remove_fun which reduces the number of rows of a dataframe based on some conditions (this function is too verbose to include in this question). This function takes as its input a dataframe with 2 columns. For example, an input df called block_2_df could look likes this:
block_2_df
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
For this example, let's say the function remove_fun removes 1 row at a time based on the highest value of seq in block_2_df$seq. Applying remove_fun once would result in a new dataframe that looks like this:
remove_fun(block_2_df)
Treatment seq
1 29
1 23
1 6
2 41
1 5
2 44
I.e., the row containing seq==60 in block_2_df was removed via remove_fun
I can create a while loop which repeats this operation on block_2_df via remove_fun based on the number of rows remaining in block_2_df as:
while (dim(block_2_df)[1]>1) {
block_2_df <- remove_fun(block_2_df)
print(remove_fun(block_2_df))
}
This while loop reduces block_2_df until it has 1 row left (the lowest value of block_2_df$seq), and prints out the 'updated' versions of block_2_df until it is reduced to one row.
However, I'd like to save each 'updated' version of block_2_df (i.e. block_2_df with 7, then 6, then 5,....,then 1 row) produced from the while loop. How can I accomplish this? I know for for loops, this could be done by creating an empty list at storing each 'updated' block_2_df in the ith element in the empty list. But I'm not sure how to do something similar in a while loop. It would be great to have a list of dfs as output from this while loop.

Just create and maintain an index counter yourself. It's a bit more trouble than a for() loop, that does it on its own but it's not so difficult.
saved <- list()
i <- 1
while (dim(block_2_df)[1]>1) {
block_2_df <- remove_fun(block_2_df)
saved[[i]] <- block_2_df
i <- i + 1
print(block_2_df)
}
Also, you were calling remove_funtwice in your loop, that was probably not what you wanted to do. I've corrected that, if I'm wrong please say so.

Related

How to apply a function in different ranges of a vectror in R?

I have the following matrix:
x=matrix(c(1,2,2,1,10,10,20,21,30,31,40,
1,3,2,3,10,11,20,20,32,31,40,
0,1,0,1,0,1,0,1,1,0,0),11,3)
I would like to find for each unique value of the first column in x, the maximum value (across all records having that value of the first column in x) of the third column in x.
I have created the following code:
v1 <- sequence(rle(x[,1])$lengths)
A=split(seq_along(v1), cumsum(v1==1))
A_diff=rep(0,length(split(seq_along(v1), cumsum(v1==1))))
for( i in 1:length(split(seq_along(v1), cumsum(v1==1))) )
{
A_diff[i]=max(x[split(seq_along(v1), cumsum(v1==1))[[i]],3])
}
However, the provided code works only when same elements are consecutive in the first column (because I use rle) and I use a for loop.
So, how can I do it to work generally without the for loop as well, that is using a function?
If I understand correctly
> tapply(x[,3],x[,1],max)
1 2 10 20 21 30 31 40
1 1 1 0 1 1 0 0
For grouping more than 1 variable I would do aggregate, note that matrices are cumbersome for this purpose, I would suggest you transform it to a data frame, nonetheless
> aggregate(x[,3],list(x[,1],x[,2]),max)

Minimising number of computations in fuzzy matching and a for loop

I am currently trying to find some potential duplicates in a large data set (500,000+ lines) using fuzzy matching. There are three main parts to this code:
A function that I have written that identifies the most like potential duplicate in a data set (by returning a score - it selects the highest score).
A function that identifies the position of the record that is the most likely to be a duplicate.
A for loop that runs both of the functions above for every record and returns values in the DupScore column and the positionBestMatch column.
An example of a resulting dataset is below:
Name: DOB: DupScore positionbestMatch
Ben 6/3/1994 15 3
Abe 5/5/2005 11 5
Benjamin 6/3/1994 15 1
Gabby 01/01/1900 10 6
Abraham 5/5/2005 11 2
Gabriella 01/01/1900 10 4
The for loop to calculate these scores looks a bit like this (scorefunc and position func are self
written functions):
for (i in c(1:length(df$Name))) {
df$dupScore[i]<-scorefunc[i]
df$positionBestMatch[i]<-positionfunc[i]
}
Obviously, on a data set with so many rows, this loop is time consuming and computationally intensive as it loops through each row. How can I edit my for loop so that :
When a DupScore is calculated for a row, it will also insert the score not only in the [i] row, but also the row of positionbestMatch ?
And have the loop only run for those with empty DupScore and positionBestMatch values.
I hope this makes sense!
Try using a while loop
all_inds <- seq_len(nrow(df))
i <- all_inds[1]
while (length(all_inds) > 1) {
i <- all_inds[1]
df$dupScore[i]<-scorefunc[i]
df$positionBestMatch[i]<-positionfunc[i]
df$dupScore[df$positionBestMatch[i]] <- df$dupScore[i]
all_inds <- setdiff(all_inds, c(i, df$positionBestMatch[i]))
}
But this will keep some empty values for df$positionBestMatch.

For Loop to fill a Column in R

I have a data frame with zero columns and zero rows, and I want to have the for loop fill in numbers from 1 to 39. The numbers should be repeating themselves twice until 39, so for instance, the result I am looking for will be in one column, where each number repeats twice
Assume st is the data frame I have set already. This is what I have so far:
for(i in 1:39) {
append(st,i)
for(i in 1:39) {
append(st,i)
}
}
Expected outcome will be in a column structure:
1
1
2
2
3
3
.
.
.
.
39
39
You don't need to use for loop. Instead use rep()
# How many times you want each number to repeat sequentially
times_repeat <- 2
# Assign the repeated values as a data frame
test_data <- as.data.frame(rep(1:39, each = times_repeat))
# Change the column name if you want to
names(test_data) <- "Dont_encourage_the_use_of_blanks_in_column_names"

Calling on a column from a data frame within a data frame

I have a list of data frame (lets call that "data") that I have generated which goes something like this:
$"something.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
$"something else.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
I would like to output from the table "something.csv" one number within column x.
So far I have used:
data$"something.csv"$x[2]
This coding works and I am happy that it does, but my problem is that I want to automate this process and so i have put all the table titles into a list (filename) which goes:
[1] "something.csv", "something else.csv"
So i made a for loop which should allow me to do so but when I put in:
data$filename[1]$x[2]
it gives me back NULL.
When i print filename[1], I get [1] "something.csv" and if I type
data$"something.csv"$x[2]
I get the result I want. so if filename[1] = "something.csv", why does it not give me the same results?
I just want my code to out put the second row of column x and automate by using filename[i] in a for loop.
The way you have tried to approach the problem tries to find a column 'filename[1]' from the list, but it is not found. Hence, the NULL gets returned.
You need to use square brackets, and subset the object data. Here's an example:
# Generate data
data<-vector("list", 2)
names(data)<-c("something.csv", "something else.csv")
data[[1]]<-data.frame(x=1:3, y=1:3, z=1:3)
data[[2]]<-data.frame(x=1:3, y=1:3, z=1:3)
filename<-names(l)
# Subset the data
# The first data frame, notice the square brackets for subsetting lists!
data[[filename[1]]]
# column x
data[[filename[1]]]$x
# Second observation of x
data[[filename[1]]]$x[2]
The above uses for subsetting the names of the objects in the list. You can also use the number-based subsetting suggested by #Jeremy.
you can also use [ and [[ to call data$"something.csv"$x[2] try
data[[1]][2,1]
where [[1]] is the first list element and [2,1] is the data frame reference element

Applying a function on each row of a data frame in R

I would like to apply some function on each row of a dataframe in R.
The function can return a single-row dataframe or nothing (I guess 'return ()' return nothing?).
I would like to apply this function on each of the rows of a given dataframe, and get the resulting dataframe (which is possibly shorter, i.e. has less rows, than the original one).
For example, if the original dataframe is something like:
id size name
1 100 dave
2 200 sarah
3 50 ben
And the function I'm using gets a row n the dataframe (i.e. a single-row dataframe), returns it as-is if the name rhymes with "brave", otherwise returns null, then the result should be:
id size name
1 100 dave
This example actually refers to filtering a dataframe, and I would love to get both an answer specific to this kind of task but also to a more general case when even the result of the helper function (the one that operates on a single row) may be an arbitrary data frame with a single row. Please note than even in the case of filtering, I would like to use some sophisticated logic (not something simple like $size>100, but a more complex condition that is checked by a function, let's say boo(single_row_df).
P.s.
What I have done so far in these cases is to use apply(df, MARGIN=1) then do.call(rbind ...) but I think it give me some trouble when my dataframe only has a single row (I get Error in do.call(rbind, filterd) : second argument must be a list)
UPDATE
Following Stephen reply I did the following:
ranges.filter <- function(ranges,boo) {
subset(x=ranges,subset=!any(boo[start:end]))
}
I then call ranges.filter with some ranges dataframe that looks like this:
start end
100 200
250 400
698 1520
1988 2147
...
and some boolean vector
(TRUE,FALSE,TRUE,TRUE,TRUE,...)
I want to filter out any ranges that contain a TRUE value from the boolean vector. For example, the first range 100 .. 200 will be left in the data frame iff the boolean vector is FALSE in positions 100 .. 200.
This seems to do the work, but I get a warning saying numerical expression has 53 elements: only the first used.
For the more general case of processing a dataframe, get the plyr package from CRAN and look at the ddply function, for example.
install.packages(plyr)
library(plyr)
help(ddply)
Does what you want without masses of fiddling.
For example...
> d
x y z xx
1 1 0.68434946 0.643786918 8
2 2 0.64429292 0.231382912 5
3 3 0.15106083 0.307459540 3
4 4 0.65725669 0.553340712 5
5 5 0.02981373 0.736611949 4
6 6 0.83895251 0.845043443 4
7 7 0.22788855 0.606439470 4
8 8 0.88663285 0.048965094 9
9 9 0.44768780 0.009275935 9
10 10 0.23954606 0.356021488 4
We want to compute the mean and sd of x within groups defined by "xx":
> ddply(d,"xx",function(r){data.frame(mean=mean(r$x),sd=sd(r$x))})
xx mean sd
1 3 3.0 NA
2 4 7.0 2.1602469
3 5 3.0 1.4142136
4 8 1.0 NA
5 9 8.5 0.7071068
And it gracefully handles all the nasty edge cases that sometimes catch you out.
You may have to use lapply instead of apply to force the result to be a list.
> rhymesWithBrave <- function(x) substring(x,nchar(x)-2) =="ave"
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) dfr[i,] else NULL,
+ dfr))
id size name
1 1 100 dave
But in this case, subset would be more appropriate:
> subset(dfr,rhymesWithBrave(name))
id size name
1 1 100 dave
If you want to perform additional transformations before returning the result, you can go back to the lapply approach above:
> add100tosize <- function(x) within(x,size <- size+100)
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) add100tosize(dfr[i,])
+ else NULL,dfr))
id size name
1 1 200 dave
Or, in this simple case, apply the function to the output of subset.
> add100tosize(subset(dfr,rhymesWithBrave(name)))
id size name
1 1 200 dave
UPDATE:
To select rows that do not fall between start and end, you might construct a different function (note: when summing result of boolean/logical vectors, TRUE values are converted to 1s and FALSE values are converted to 0s)
test <- function(x)
rowSums(mapply(function(start,end,x) x >= start & x <= end,
start=c(100,250,698,1988),
end=c(200,400,1520,2147))) == 0
subset(dfr,test(size))
It sounds like you want to use subset:
subset(orig.df,grepl("ave",name))
The second argument evaluates to a logical expression that determines which rows are kept. You can make this expression use values from as many columns as you want, eg grepl("ave",name) & size>50

Resources