Stepwise fill dataframe - r

I'm using a for-loop to perform operations on specific subsets of my data. At the end of each iteration of the for loop, I have all the values that I need to fill a row of my dataframe.
So far I tried
df=NULL
for(...){
//stuff to calculate
newline=c(allthethingscalculated)
df=rbind(df,newline)
}
this results in the contents of the dataframe not being accessable using '$' , because the rows are then atomic vectors.
I also tried to append the values I get at the end of each iteration to an already existing vector and when the for loop ends create a dataframe from these vectors using but appending the values to the respective vector didn't work, the values weren't added.
x<-data.frame(a,b,c,d,...)
Any ideas on this?
Since my for loop iterates over IDs in my data, I realized I could do something like this:
uids=unique(data$id)
filler=c(1:length(uids))
df=data.frame(uids,filler,filler,filler,filler,filler,filler,filler,filler,filler)
for(i in uids){
...
df[i,]<-newline
}
I used filler to create a dataframe with the correct number of columns and rows so I don't get an error like 'replacement has length of 9, replacement has length of 1'
Is there a better way to do this? Using this approach I still have the values of filler in the respective row that I'd need to remove?

This should work, can your show us you data ?
R) x=data.frame(a=rep(1,3),b=rep(2,3),c=rep(3,3))
R) d=c(4,4,4)
R) rbind(x,d)
a b c
1 1 2 3
2 1 2 3
3 1 2 3
4 4 4 4
R) cbind(x,d)
a b c d
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4

Related

R - counting adjacent duplicate items

New to R and would like to do the following operation:
I have a set of numbers e.g. (1,1,0,1,1,1,0,0,1) and need to count adjacent duplicates as they occur. The result I am looking for is:
2,1,3,2,1
as in 2 ones, 1 zero, 3 ones, etc.
Thanks.
We can use rle
rle(v1)$lengths
#[1] 2 1 3 2 1
data
v1 <- c(1,1,0,1,1,1,0,0,1)

Create vector by given distibution of values

Let's say I have a vector a = (1,3,4).
I want to create new vector with integer numbers in range [1,length(a)]. But the i-th number should appear a[i] times.
For the vector a I want to get:
(1,2,2,2,3,3,3,3)
Would you explain me how to implement this operation without several messy concatenations?
You can try rep
rep(seq_along(a), a)
#[1] 1 2 2 2 3 3 3 3
data
a <- c(1,3,4)

replace specific number in a vector from a list in r

I've been working on this problem and can't seem to figure out the proper solution. Ultimately I'm going to use dplyr in order to group by and apply a function to a column. I turned the column into a vector. Here is a snippet:
vec1 <- append(append(append(rep(1,3),rep(2,6)), rep(3,5)),rep(4,2))
if the number repeats more than 3 times, I want to change the following number to 1. So in the vector above, the number 2 occurs 6 times and the number 3 occurs 5 times. That means I want to replace the number 3 and 4 with 1. Ultimately in this snippet, the answer I'm looking for is:
c(1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1)
What I have below worked for cases when only one number was repeated more than 3 times, but not multiple. In addition, if I'm doing this inefficiently I'd like to learn how to better script it.
stack <- table(vec1)
stack1 <- list(as.numeric(rownames(data.frame(stack[stack>3]))) + 1)
replace(vec1,vec1 == stack1,1)
thanks in advance for any help
Try
inverse.rle(within.list(rle(vec1),
values[c(FALSE,(lengths >3)[-length(lengths)])] <- 1))
#[1] 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1

Calling on a column from a data frame within a data frame

I have a list of data frame (lets call that "data") that I have generated which goes something like this:
$"something.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
$"something else.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
I would like to output from the table "something.csv" one number within column x.
So far I have used:
data$"something.csv"$x[2]
This coding works and I am happy that it does, but my problem is that I want to automate this process and so i have put all the table titles into a list (filename) which goes:
[1] "something.csv", "something else.csv"
So i made a for loop which should allow me to do so but when I put in:
data$filename[1]$x[2]
it gives me back NULL.
When i print filename[1], I get [1] "something.csv" and if I type
data$"something.csv"$x[2]
I get the result I want. so if filename[1] = "something.csv", why does it not give me the same results?
I just want my code to out put the second row of column x and automate by using filename[i] in a for loop.
The way you have tried to approach the problem tries to find a column 'filename[1]' from the list, but it is not found. Hence, the NULL gets returned.
You need to use square brackets, and subset the object data. Here's an example:
# Generate data
data<-vector("list", 2)
names(data)<-c("something.csv", "something else.csv")
data[[1]]<-data.frame(x=1:3, y=1:3, z=1:3)
data[[2]]<-data.frame(x=1:3, y=1:3, z=1:3)
filename<-names(l)
# Subset the data
# The first data frame, notice the square brackets for subsetting lists!
data[[filename[1]]]
# column x
data[[filename[1]]]$x
# Second observation of x
data[[filename[1]]]$x[2]
The above uses for subsetting the names of the objects in the list. You can also use the number-based subsetting suggested by #Jeremy.
you can also use [ and [[ to call data$"something.csv"$x[2] try
data[[1]][2,1]
where [[1]] is the first list element and [2,1] is the data frame reference element

Applying a function on each row of a data frame in R

I would like to apply some function on each row of a dataframe in R.
The function can return a single-row dataframe or nothing (I guess 'return ()' return nothing?).
I would like to apply this function on each of the rows of a given dataframe, and get the resulting dataframe (which is possibly shorter, i.e. has less rows, than the original one).
For example, if the original dataframe is something like:
id size name
1 100 dave
2 200 sarah
3 50 ben
And the function I'm using gets a row n the dataframe (i.e. a single-row dataframe), returns it as-is if the name rhymes with "brave", otherwise returns null, then the result should be:
id size name
1 100 dave
This example actually refers to filtering a dataframe, and I would love to get both an answer specific to this kind of task but also to a more general case when even the result of the helper function (the one that operates on a single row) may be an arbitrary data frame with a single row. Please note than even in the case of filtering, I would like to use some sophisticated logic (not something simple like $size>100, but a more complex condition that is checked by a function, let's say boo(single_row_df).
P.s.
What I have done so far in these cases is to use apply(df, MARGIN=1) then do.call(rbind ...) but I think it give me some trouble when my dataframe only has a single row (I get Error in do.call(rbind, filterd) : second argument must be a list)
UPDATE
Following Stephen reply I did the following:
ranges.filter <- function(ranges,boo) {
subset(x=ranges,subset=!any(boo[start:end]))
}
I then call ranges.filter with some ranges dataframe that looks like this:
start end
100 200
250 400
698 1520
1988 2147
...
and some boolean vector
(TRUE,FALSE,TRUE,TRUE,TRUE,...)
I want to filter out any ranges that contain a TRUE value from the boolean vector. For example, the first range 100 .. 200 will be left in the data frame iff the boolean vector is FALSE in positions 100 .. 200.
This seems to do the work, but I get a warning saying numerical expression has 53 elements: only the first used.
For the more general case of processing a dataframe, get the plyr package from CRAN and look at the ddply function, for example.
install.packages(plyr)
library(plyr)
help(ddply)
Does what you want without masses of fiddling.
For example...
> d
x y z xx
1 1 0.68434946 0.643786918 8
2 2 0.64429292 0.231382912 5
3 3 0.15106083 0.307459540 3
4 4 0.65725669 0.553340712 5
5 5 0.02981373 0.736611949 4
6 6 0.83895251 0.845043443 4
7 7 0.22788855 0.606439470 4
8 8 0.88663285 0.048965094 9
9 9 0.44768780 0.009275935 9
10 10 0.23954606 0.356021488 4
We want to compute the mean and sd of x within groups defined by "xx":
> ddply(d,"xx",function(r){data.frame(mean=mean(r$x),sd=sd(r$x))})
xx mean sd
1 3 3.0 NA
2 4 7.0 2.1602469
3 5 3.0 1.4142136
4 8 1.0 NA
5 9 8.5 0.7071068
And it gracefully handles all the nasty edge cases that sometimes catch you out.
You may have to use lapply instead of apply to force the result to be a list.
> rhymesWithBrave <- function(x) substring(x,nchar(x)-2) =="ave"
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) dfr[i,] else NULL,
+ dfr))
id size name
1 1 100 dave
But in this case, subset would be more appropriate:
> subset(dfr,rhymesWithBrave(name))
id size name
1 1 100 dave
If you want to perform additional transformations before returning the result, you can go back to the lapply approach above:
> add100tosize <- function(x) within(x,size <- size+100)
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) add100tosize(dfr[i,])
+ else NULL,dfr))
id size name
1 1 200 dave
Or, in this simple case, apply the function to the output of subset.
> add100tosize(subset(dfr,rhymesWithBrave(name)))
id size name
1 1 200 dave
UPDATE:
To select rows that do not fall between start and end, you might construct a different function (note: when summing result of boolean/logical vectors, TRUE values are converted to 1s and FALSE values are converted to 0s)
test <- function(x)
rowSums(mapply(function(start,end,x) x >= start & x <= end,
start=c(100,250,698,1988),
end=c(200,400,1520,2147))) == 0
subset(dfr,test(size))
It sounds like you want to use subset:
subset(orig.df,grepl("ave",name))
The second argument evaluates to a logical expression that determines which rows are kept. You can make this expression use values from as many columns as you want, eg grepl("ave",name) & size>50

Resources