I have a data.frame (df) with different number of rows (numElement) and I wish to extract from it X elements (numExtract) distributed equally in the df and store them in the new dataframe (extractData). When I use the script below sometimes I get in the extractData different number of element (bigger by one from the numExtract). How can I fix it?
Script:
numElement<-400
df<-data.frame(seq(1:numElement))
numExtract<-5
extractData <- df[seq(1, nrow(df), by = round(nrow(df)/numExtract)),]
numElement<-400
df<-data.frame(seq(1:numElement))
numExtract<-7
extractData <- df[seq(1, nrow(df), by = round(nrow(df)/numExtract)),]
I cannot comment yet but round without extra arguments rounds the number to the nearest integer.
In your first case you want every 80th element and then in the second case every 57th element and that means you'll get the elements with indexes of 1 58 115 172 229 286 343 400 (total 8 indexes here).
Custom function
Use cut to obtain more intuitive intervals, and extract the breaks. It uses gsubfn:strapply for substring extraction
library(gsubfn)
myfun <- function(maxval, numbreaks) {
require(gsubfn)
x <- unique(cut(1:maxval, numbreaks-1))
A <- sapply(x, function(Z) round(as.numeric(strapply(as.character(Z), "^[(](\\S+)[,]", perl=TRUE))))
A <- c(A, maxval)
return(A)
}
Output
myfun(400, 5)
# 1 101 200 300 400
myfun(400, 7)
# 1 68 134 200 267 334 400
Related
Sorry for my probably easy to solve question. I have a dataframe of 400 columns and 17532 rows and I want to unlist this to create another dataframe of 8 columns and 876600 rows. Basically unlist from 1 to 50, 51 to 100, 101 to 150 etc... However I'm running into some problems:
flw<-data.frame(matrix(NA, ncol=8, nrow=876600)) ## create an empty dataframe
seq(1,length(fl),by=50) ## sequence of columns which effectively is '1 51 101 151 201 251 301 351' with length(fl)=400
for (i in seq(1,length(fl),by=50) ){
flw[i] <- as.data.frame(unlist(fl[i:(i+49)]))
}
I get the error:
Error in `[<-.data.frame`(`*tmp*`, i, value = list(`unlist(fl[(i):(i + 49)])` = c(13.24512, :
new columns would leave holes after existing columns
I don't understand why as it shouldn't leave any holes. It should unlist from 1 to 50, and then 51 to 100 etc and this will be 8 columns x 876600. What am I missing?
Your problem is the indexing into the flw. You are using the same index that you are using for fl. You end up trying to assigning to row 1, then row 51. Your holes are rows 2:49.
If you're not tidyverse adverse this can be done without the loop using the purrr and dplyr
purrr::map_dfc(
seq(1,length(fl),by=50),
~dplyr::select(fl, .x:(.x+49)) %>% unlist(),
)
I wrote a function in R which is supposed to return the first five developers who made the most input:
developer.busy <- function(x){
bus.dev <- sort(table(test2$devf), decreasing = TRUE)
return(bus.dev)
}
bus.dev(test2)
ericb shields mdejong cabbey lord elliott-oss jikesadmin coar
3224 1432 998 470 241 179 77 1
At the moment it just prints out all developers sorted in decreasing range. I just want the first 5 to be shown. How can I make this possible. Any suggestion is welcome.
If we want the first five, either use index with [ or with head. Modified the function with three input, data object name, column name ('colnm') and number of elements to extract ('n')
developer.busy <- function(data, colnm, n){
sort(table(data[[colnm]]), decreasing = TRUE)[seq_len(n)]
# or another optioin is
head(sort(table(data[[colnm]]), decreasing = TRUE), n)
}
developer.busy(test2, "developerf", n = 5)
-using a reproducible example with mtcars dataset
data(mtcars)
developer.busy(mtcars, 'carb', 5)
# 2 4 1 3 6
#10 10 7 3 1
i have a code using R language, i want to sum all data frame (df$number is unlist result in 'res')
total result is = [1] 1 3 5 7 9 20 31 42
digits <- function(x){as.integer(substring(x, seq(nchar(x)), seq(nchar(x))))}
generated <- function(x){ x + sum(digits(x))}
digitadition <- function(x,N) { c(x, replicate(N-1, x <<- generated(x))) }
res <- NULL
for(i in 0:50){
for(j in 2:50){
tmp <- digitadition(i,j)
IND <- 50*(i-1) + (j-1) - (i-1) #to index results
res[IND] <- tmp[length(tmp)]
}
}
df <- data.frame(number = unlist(res), generator=rep(1:50, each=49), N=2:50)
total <- table(df$number)[as.numeric(names(table(df$number)))<=50]
setdiff(1:50, as.numeric(names(total)))
sum(total)
i'm using sum(total) but the result of summary is '155' it is not the right answer, cause the right answer is '118'
what the spesific code to sum the 'total'?
thank you.
I ran your code and I think you may be confused on what you want to sum.
You setdiff contains the values 1 3 5 7 9 20 31 42 which sum is 118.
So, if you do sum(setdiff(1:50, as.numeric(names(total)))), you'll get the 118 you are looking for.
Your total variable is different from this. Let me explain what you are doing and what I think you should do.
Your code: total <- table(df$number)[as.numeric(names(table(df$number)))<=50]]
When you table(), you get each unique value from the vector, and the number of how many times this number appears on your vector.
And when you get the names() of this table, you get each of these unique values as a character, that's why you are setting as.numeric.
But the function unique() do this job for you, he extracts uniques values from a vector.
Here's what you can do: total <- unique(df$number[which(df$number <= 50)])
Where which() get the ID's of values <= 50, and unique extracts unique values of these ID's.
And finally: sum(setdiff(1:50, total)) that sums all the values from 1 to 50 that are not in your total vector.
And in my opinion, sum(setdiff(total, 1:50)) its more intuitive.
I have data in columns which I need to do calculations on. Is it possible to do this using previous row values without using a loop? E.g. if in the first column the value is 139, calculate the median of the last 5 values and the percent change of the value 5 rows above and the value in the current row?
ID Data PF
135 5 123
136 4 141
137 5 124
138 6 200
139 1 310
140 2 141
141 4 141
So here in this dataset you would do:
Find 139 in ID column
Return average of last 5 rows in Data (Gives 4.2)
Return performance of values in PF 5 rows above to current value (Gives 152%)
If I would do a loop it looks like this:
for (i in 1:nrow(data)){
if(data$ID == "139" & i>=3)
{data$New_column <- data[i,"PF"] / data[i-4,"PF"] - 1
}
The problem is that the loop takes too long due to to many data points. The ID 139 will appear several times in the dataset.
Many thanks.
Carlos
As pointed out by Tutuchacn and Sotos, use the package zoo to get the mean of the Data in the last N rows (inclusive of the row) you are querying (assuming your data is in the data frame df):
library(zoo)
ind <- which(df$ID==139) ## this is the row you are querying
N <- 5 ## here, N is 5
res <- rollapply(df$Data, width=N, mean)[ind-(N-1)]
print(res)
## [1] 4.2
rollapply(..., mean) returns the rolling mean of the windowed data of width=N. Note that the index used to query the output from rollapply is lagged by N-1 because the rolling mean is applied forward in the series.
To get the percent performance from PF as you specified:
percent.performance <- function(x) {
z <- zoo(x) ## create a zoo series
lz <- lag(z,4) ## create the lag version
return(z/lz - 1)
}
res <- as.numeric(percent.performance(df$PF)[ind])
print(res)
## [1] 1.520325
Here, we define a function percent.performance that returns what you want for all rows of df for which the computation makes sense. We then extract the row we want using ind and convert it to a number.
Hope this helps.
Is that what you want?
ntest=139
sol<-sapply(5:nrow(df),function(ii){#ii=6
tdf<-df[(ii-4):ii,]
if(tdf[5,1]==ntest)
c(row=ii,aberage=mean(tdf[,"Data"]),performance=round(100*tdf[5,"PF"]/tdf[1,"PF"]-1,0))
})
sol<- sol[ ! sapply(sol, is.null) ] #remove NULLs
sol
[[1]]
row aberage performance
5.0 4.2 251.0
This could be a decent start:
mytext = "ID,Data,PF
135,5,123
136,4,141
137,5,124
138,6,200
139,1,310
140,2,141
141,4,141"
mydf <- read.table(text=mytext, header = T, sep = ",")
do.call(rbind,lapply(mydf$ID[which(mydf$ID==139):nrow(mydf)], function(x) {
tempdf <- mydf[1:which(mydf$ID==x),]
data.frame(ID=x,Data=mean(tempdf$Data),PF=100*(tempdf[nrow(tempdf),"PF"]-tempdf[(nrow(tempdf)-4),"PF"])/tempdf[(nrow(tempdf)-4),"PF"])
}))
ID Data PF
139 4.200000 152.03252
140 3.833333 0.00000
141 3.857143 13.70968
The idea here is: You take ID's starting from 139 to the end and use the lapply function on each of them by generating a temporary data.frame which includes all the rows above that particular ID (including the ID itself). Then you grab the mean of the Data column and the rate of change (i.e. what you call performance) of the PF column.
I know this should be simple but I just can't do it...I have a data frame called data that works nicely and does what I want it to with the correct column headers and everything. I can call colSums() to get a list of 21 numbers which are the sums of each column.
> a <- colSums(data,na.rm = TRUE)
> names(a) <- NULL
> a
[1] 1000000.00 680000.00 170000.00 462400.00 115600.00 144500.00 314432.00 78608.00 98260.00 122825.00 213813.76 53453.44 66816.80
[14] 83521.00 104401.25 145393.36 36348.34 45435.42 56794.28 70992.85 88741.06
The problem is I need a list with the first number alone, the sum of the next two, sum of the next 3, sum of the next 4 etc. until I run out of numbers. I imagine it would look something like this:
c(sum(a[1]),sum(a[2:3]),sum(a[4:6])... etc.
Any help or a different way to do this would be greatly appreciated!
Thank you.
You should only need to go out to something on the order of sqrt(length(vector)). The seq function lets you specify a start integer and a length, so sending a sequence of integers to seq(1+x*(x-1)/2, length=x) should create the right set of sequences. It wasn't clear whether incomplete sequences at the end should return a result or NA so I put in na.rm=TRUE. You might decide otherwise. (You did not illustrate a dataframe but rather an ordinary numeric vector.
sumsegs <- function(vec) sapply(1:sqrt(2*length(vec)), function(x)
sum( vec[seq(1+x*(x-1)/2, length=x)], na.rm=TRUE) )
a <- scan()
1000000.00 680000.00 170000.00 462400.00 115600.00 144500.00 314432.00 78608.00 98260.00 122825.00 213813.76 53453.44 66816.80 83521.00 104401.25 145393.36 36348.34 45435.42 56794.28 70992.85 88741.06
# 22: enter carriage return to stop scan input
#Read 21 items
sumsegs(a)
#[1] 1000000.0 850000.0 722500.0 614125.0 522006.2 443705.3
I'm not exactly sure what the right upper limit on the number to send to the inner function. sqrt(length(vec)) is too short, but sqrt(2*length(vec)) seems to be "working" at lower numbers anyway.
> sapply( sapply(1:sqrt(2*100), function(x) seq(1+x*(x-1)/2, length=x) ), max)
[1] 1 3 6 10 15 21 28 36 45 55 66 78 91 105
> sapply( sapply(1:sqrt(100), function(x) seq(1+x*(x-1)/2, length=x) ), max)
[1] 1 3 6 10 15 21 28 36 45 55
This is a function that returns the last element in sequences so formed and making the factor 2.1 rather than 2 corrects minor deficiencies in the range of length 500-1000:
tail(lapply( sapply(1:sqrt(2.1*500), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 528
tail(lapply( sapply(1:sqrt(2.1*500), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 496
Going higher did not seem to degrade the "times 2" correction. There's probably some kewl number theory explanation for this.
tail(lapply( sapply(1:sqrt(2*100000), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 100128
Alternatively a much more naive method is:
sums=colSums(data)
n=0 # number of sums
i=1 # currentIndex
intermediate=0;
newIndex=1;
newVec <- vector()
while(i<length(sums)) {
for(j in i:(i+n)) {
if(j<=length(sums))
intermediate=intermediate+sums[j]
}
if(n>1){
i=i+n+1;
}
else{
i=i+1;
}
newVec=c(newVec, intermediate);
intermediate=0;
n=n+1;
}
Here's a similar approach, using rep(...) and by(...)
n <- (-1+sqrt(1+8*length(a)))/2 # number of groups
groups <- rep(1:n,1:n) # indexing vector
result <- as.vector(by(a,groups,sum))
result
# [1] 1000000.0 850000.0 722500.0 614125.0 522006.2 443705.3