I have a situation like this: first of all I have a data.frame:
DF
COL1 COL2
29 1623
27 1600
30 1617
8 1620
Then, I have a vector like this:
[1] [2]
50 1602
What I need is to bind the first row of DF with the vector to have:
output
[1] [2]
29 1623
50 1602
On this output I would like to apply the prop.test using this code:
prop.test(output[,1], output[,2], correct=FALSE)
I need to do this on the entire DF, so:
first: bind first row of DF with the vector
second: prop.test
then again
first: bind second row of DF with the vector
second: prop.test
This iteratively.
Any suggestion please?
thanks a lot
apply(DF, 1, function(x) prop.test( rbind(x, c(50, 1602) ) ,correct=FALSE ) )
Related
Suppose we have a data frame containing numeric values which looks like:
Temperature Height
32 157
31 159
33 139
I want to replace Height values with pic_00001, pic_00002 etc. so that the end result is:
Temperature Height
32 pic_00001
31 pic_00002
33 pic_00003
There are 10,000+ rows in the full data frame, hence I need a quicker way than doing this manually.
You can use sprintf:
# create the example used by the OP
dat <- data.frame(Temperature = 31:33,
Height = c(157, 159, 139))
# use sprintf along with seq_len
dat$Height <- sprintf("pic_%05d", seq_len(NROW(dat)))
# show the result
dat
#R> Temperature Height
#R> 1 31 pic_00001
#R> 2 32 pic_00002
#R> 3 33 pic_00003
You can change the 05d if you want more leading zeros. E.g. 07d will give a seven digit sequence. The manual page for sprintf have further details.
You can do:
id <- seq_len(nrow(data))
new_values <- paste("pic_",id,sep = "")
data$Height <- new_values
to achieve final output (from original by monjeanjean, i cant comment yet lol):
id <- seq_len(nrow(data))
new_values <- paste("pic_",formatC(x,width=5,flag="0",format="fg"),sep = "")
data$Height <- new_values
You can also use the following solution:
library(dplyr)
library(stringr)
df %>%
mutate(across(Height, ~ str_c("pic_", str_pad(row_number(), 5, "left", "0"))))
Temperature Height
1 32 pic_00001
2 31 pic_00002
3 33 pic_00003
How can I divide my data-frame which have 250 columns into 5 subsets with 50 columns each and assign them into 5 different variables?
I have tried this
df2 <- split(df, sample(1:5, ncol(df), replace=T))
But this only splits based on number rows but not on number of columns
And I want something like this
ncol(df2_1) = 50
ncol(df2_2) = 50
ncol(df2_3) = 50
ncol(df2_4) = 50
ncol(df2_5) = 50
And these should include independent columns.
Using comments by #markus, to use split.default, we can modify the initial code, and change the sampling so we get exactly 50 in each subset,
Making some dummy data,
df <- data.frame(matrix(1:250, ncol = 250))
Then splitting, (we split this way because of this, pointed out by #markus, this is a more safe/robust version)
df2 <- lapply(split.data.frame(t(df), sample(rep(1:5, ncol(df)/5))), t)
A less robust, but more simple option is:
df2 <- split.default(df, sample(rep(1:5, ncol(df)/5)))
gives us,
> ncol(df2$`1`)
[1] 50
> ncol(df2$`2`)
[1] 50
> ncol(df2$`3`)
[1] 50
> ncol(df2$`4`)
[1] 50
> ncol(df2$`5`)
[1] 50
I have data in columns which I need to do calculations on. Is it possible to do this using previous row values without using a loop? E.g. if in the first column the value is 139, calculate the median of the last 5 values and the percent change of the value 5 rows above and the value in the current row?
ID Data PF
135 5 123
136 4 141
137 5 124
138 6 200
139 1 310
140 2 141
141 4 141
So here in this dataset you would do:
Find 139 in ID column
Return average of last 5 rows in Data (Gives 4.2)
Return performance of values in PF 5 rows above to current value (Gives 152%)
If I would do a loop it looks like this:
for (i in 1:nrow(data)){
if(data$ID == "139" & i>=3)
{data$New_column <- data[i,"PF"] / data[i-4,"PF"] - 1
}
The problem is that the loop takes too long due to to many data points. The ID 139 will appear several times in the dataset.
Many thanks.
Carlos
As pointed out by Tutuchacn and Sotos, use the package zoo to get the mean of the Data in the last N rows (inclusive of the row) you are querying (assuming your data is in the data frame df):
library(zoo)
ind <- which(df$ID==139) ## this is the row you are querying
N <- 5 ## here, N is 5
res <- rollapply(df$Data, width=N, mean)[ind-(N-1)]
print(res)
## [1] 4.2
rollapply(..., mean) returns the rolling mean of the windowed data of width=N. Note that the index used to query the output from rollapply is lagged by N-1 because the rolling mean is applied forward in the series.
To get the percent performance from PF as you specified:
percent.performance <- function(x) {
z <- zoo(x) ## create a zoo series
lz <- lag(z,4) ## create the lag version
return(z/lz - 1)
}
res <- as.numeric(percent.performance(df$PF)[ind])
print(res)
## [1] 1.520325
Here, we define a function percent.performance that returns what you want for all rows of df for which the computation makes sense. We then extract the row we want using ind and convert it to a number.
Hope this helps.
Is that what you want?
ntest=139
sol<-sapply(5:nrow(df),function(ii){#ii=6
tdf<-df[(ii-4):ii,]
if(tdf[5,1]==ntest)
c(row=ii,aberage=mean(tdf[,"Data"]),performance=round(100*tdf[5,"PF"]/tdf[1,"PF"]-1,0))
})
sol<- sol[ ! sapply(sol, is.null) ] #remove NULLs
sol
[[1]]
row aberage performance
5.0 4.2 251.0
This could be a decent start:
mytext = "ID,Data,PF
135,5,123
136,4,141
137,5,124
138,6,200
139,1,310
140,2,141
141,4,141"
mydf <- read.table(text=mytext, header = T, sep = ",")
do.call(rbind,lapply(mydf$ID[which(mydf$ID==139):nrow(mydf)], function(x) {
tempdf <- mydf[1:which(mydf$ID==x),]
data.frame(ID=x,Data=mean(tempdf$Data),PF=100*(tempdf[nrow(tempdf),"PF"]-tempdf[(nrow(tempdf)-4),"PF"])/tempdf[(nrow(tempdf)-4),"PF"])
}))
ID Data PF
139 4.200000 152.03252
140 3.833333 0.00000
141 3.857143 13.70968
The idea here is: You take ID's starting from 139 to the end and use the lapply function on each of them by generating a temporary data.frame which includes all the rows above that particular ID (including the ID itself). Then you grab the mean of the Data column and the rate of change (i.e. what you call performance) of the PF column.
I have the following string: "1,34:36,52:58,22:28,82:88,101:102,104:153,120:254,315:368,489:nrow(df)". Is there some way of using this string to extract the rows of a dataframe (df) that correspond to the numbers in the string.
I've tried using combinations of eval and get but these don't work and doubt they are the correct route.
Example dataframe:
df <- as.data.frame( matrix(rnorm(5000), nrow=500,ncol=10) )
You could use a combination of eval and parse:
df <- as.data.frame( matrix(rnorm(5000), nrow=500,ncol=10) )
a <- "1,34:36,52:58,22:28,82:88,101:102,104:153,120:254,315:368,489:nrow(df)"
index <- unlist(lapply(strsplit(a, ",")[[1]], function(x)eval(parse(text=x))))
index
# [1] 1 34 35 36 52 53 54 ...
#[253] .... 494 495 496 497 498 499 500
Alternative solution since you "know" the name of the dataframe ('df' is already used in the string you have provided)
df=data.frame(matrix(rnorm(5000), nrow=500,ncol=10))
select_string="1,34:36,52:58,22:28,82:88,101:102,104:153,120:254,315:368,489:nrow(df)"
select_string_total=paste("df[c(",select_string,"),,drop=FALSE]",sep="")
eval(parse(text=select_string_total))
The program that I am running creates three data frames using the following code:
datuniqueNDC <- data.frame(lapply(datlist, function(x) length(unique(x$NDC))))
datuniquePID <- data.frame(lapply(datlist, function(x) length(unique(x$PAYERID)))
datlengthNDC <- data.frame(lapply(datlist, function(x) length(x$NDC)))
They have outputs that look like this:
X182L X178L X76L
1 182 178 76
X34L X31L X7L
1 34 31 7
X10674L X10021L X653L
1 10674 10021 653
What I am trying to do is combine the rows together into one data frame with the desired outcome being:
X Y Z
1 182 178 76
2 34 31 7
3 10674 10021 653
but the rbind command doesn't work due to the names of all the columns being different. I can get it to work by using the colnames command after creating each variable above, but it seems like there should be a more efficient way to accomplish this by using one of the apply commands or something similar. Thanks for the help.
one way, since evreything seems to be a numeric, would be this:
mylist <- list(dat1,dat2,dat3)
# assuming your three data.frames are dat1:dat3 respectively
do.call("rbind",lapply(mylist, as.matrix))
# X182L X178L X76L
#[1,] 182 178 76
#[2,] 34 31 7
#[3,] 10674 10021 653
basically this works because your data are matrices not dataframes, then you only need to change names once at the end.
Since the functions you use in you lapply calls are scalars, it would be easier if you use sapply. sapply returns vectors which you can rbind
datuniqueNDC <- sapply(datlist, function(x) length(unique(x$NDC)))
datuniquePID <- sapply(datlist, function(x) length(unique(x$PAYERID))
datlengthNDC <- sapply(datlist, function(x) length(x$NDC))
dat <- as.data.frame(rbind(datuniqueNDC,datuniquePID,datlengthNDC))
names(dat) <- c("x", "y", "z")
Another solution is to calculate all three of your statistics in one function:
dat <- as.data.frame(sapply(datlist, function(x) {
c(length(unique(x$NDC)), length(unique(x$PAYERID), length(x$NDC))
}))
names(dat) <- c("x", "y", "z")