Using regex to set a specific digit to NA? [duplicate] - r

This question already has answers here:
Replace all particular values in a data frame
(8 answers)
Closed 4 years ago.
Sample of df:
LASSO_deviance LASSO_AUC
68 0.999 0.999
2 1.000 1.000
39 1.000 1.005
7 1.02 1.2
I want to set cells which contain 1.000 to either NA or 0, in preferential order.
I've tried something like: df %>% mutate_at(vars(LASSO_deviance, LASSO_AUC), funs(gsub(pattern = "1{1}[^.{1,}]", 0, x = .))) with no luck.

tt <- "LASSO_deviance LASSO_AUC
68 0.999 0.999
2 1.000 1.000
39 1.000 1.005
7 1.02 1.2"
dat <- read.table(text = tt, header = T)
No need for regex because you can simply find where your data is equal to 1.000
dat[dat == 1.000] <- NA # or dat[dat == 1.000] <- 0
dat
# LASSO_deviance LASSO_AUC
# 68 0.999 0.999
# 2 NA NA
# 39 NA 1.005
# 7 1.020 1.200

Related

R - Help to build time column from 0?

I need to create a 'time' column, starting at 0 and adding increments of 0.0005. The length of the column should be dependent on the length of existing columns. What I have tried at so far is below.
So in my head, the below script says: create a column with 0 and 0.0005 as data points 1 and 2, cumulatively add the difference between data points 1 and 2 and repeat for length of specified column. This doesn't really work, hence why I am posting here. If anyone has some sage advice, it would be greatly appreciated.
df$time = c(0,0.0005, cumsum(diff(df$time [1:2], lag = 1)), length(df$other.column))
Expected outcome
time
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
0.005
0.0055
0.006
0.0065
0.007
0.0075
0.008
0.0085
0.009
0.0095
etc
We can multiply the 0.00005 with the sequence of rows
df$time <- (seq_len(nrow(df)) - 1) * 0.0005
data
df <- data.frame(a = 1:10)
It sounds like you just want the following sequence:
seq(0, 0.1, by=0.0005)
You may replace the from and to values to whatever you want via:
seq(from, to, by=0.0005)
You could use seq by specifying length.out parameter as number of rows of dataframe.
df <- data.frame(a = 1:10)
df$time <- seq(0, by = 0.0005, length.out = nrow(df))
df
# a time
#1 1 0.0000
#2 2 0.0005
#3 3 0.0010
#4 4 0.0015
#5 5 0.0020
#6 6 0.0025
#7 7 0.0030
#8 8 0.0035
#9 9 0.0040
#10 10 0.0045

How to calculate the Bonferroni Lower and Upper limits in R?

With the following data, I am trying to calculate the Chi Square and Bonferroni lower and upper Confidence intervals. The column "Data_No" identifies the dataset (as calculations needs to be done separately for each dataset).
Data_No Area Observed
1 3353 31
1 2297 2
1 1590 15
1 1087 16
1 817 2
1 847 10
1 1014 28
1 872 29
1 1026 29
1 1215 21
2 3353 31
2 2297 2
2 1590 15
3 1087 16
3 817 2
The code I used is
library(dplyr)
setwd("F:/GIS/July 2019/")
total_data <- read.csv("test.csv")
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),] data <- data %>%
mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/2*count(Data_No),
Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')
And the error message I get is:
Error in UseMethod("summarise_") : no applicable method for
'summarise_' applied to an object of class "c('integer', 'numeric')"
The column that I am calling "Alpha" is the alpha value of 0.05/2k, where K is the number of categories - in my example, I have 10 categories ("Data_No" column) for the first dataset, so "Alpha" needs to be 0.05/20 = 0.0025, and it's corresponding Z value is 2.807. The second dataset has 3 categories (so 0.05/6) and the third has 2 categories (0.05/4) in my example table (Data_No" column). Using the values from the newly calculated "Alpha" column, I then need to calculate the ZValue column (Zvalue = qnorm(Alpha,lower.tail=FALSE)) which I then use to calculate the lower and upper confidence intervals.
From the above data, here are the results that I should get, but note that I have had to manually calculate Alpha column and Zvalue, rather than insert those calculations within the R code:
Data_No Area Observed RelativeArea Alpha Z value lower upper
1 3353 31 0.237 0.003 2.807 0.092 0.247
1 2297 2 0.163 0.003 2.807 -0.011 0.033
1 1590 15 0.113 0.003 2.807 0.025 0.139
1 1087 16 0.077 0.003 2.807 0.029 0.146
1 817 2 0.058 0.003 2.807 -0.011 0.033
1 847 10 0.060 0.003 2.807 0.007 0.102
1 1014 28 0.072 0.003 2.807 0.078 0.228
1 872 29 0.062 0.003 2.807 0.083 0.234
1 1026 29 0.073 0.003 2.807 0.083 0.234
1 1215 21 0.086 0.003 2.807 0.049 0.181
2 3353 31 0.463 0.008 2.394 0.481 0.811
2 2297 2 0.317 0.008 2.394 -0.027 0.111
2 1590 15 0.220 0.008 2.394 0.152 0.473
3 1087 16 0.571 0.013 2.241 0.723 1.055
3 817 2 0.429 0.013 2.241 -0.055 0.277
Please note that I only included some of the columns generated from the code.
# You need to check the closing bracket for lower c.f. sqrt value. Following code should work.
data <- read.csv("test.csv")
data <- data %>% mutate(RelativeArea =
Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
Observed/sum(Observed), lower =
APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))
#Answer to follow-up question.
#Sample Data
Data_No Area Observed
1 3353 31
1 2297 2
2 1590 15
2 1087 16
#Code to run
total_data <- read.csv("test.csv")
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),]
data <- data %>% mutate(RelativeArea =
Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
Observed/sum(Observed), lower =
APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data)
}
write.csv(result_data,file='final_result.csv')
#Issue in calculating Alpha. I have updated the code.
library(dplyr)
setwd("F:/GIS/July 2019/")
total_data <- read.csv("test.csv")
#Creating the NO_OF_CATEGORIES column based on your question.
total_data$NO_OF_CATEGORIES <- 0
total_data[which(total_data$Data_No==1),]$NO_OF_CATEGORIES <- 10
total_data[which(total_data$Data_No==2),]$NO_OF_CATEGORIES <- 3
total_data[which(total_data$Data_No==3),]$NO_OF_CATEGORIES <- 2
#Actual code
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),]
data <- data %>%
mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/(2*(unique(data$NO_OF_CATEGORIES))),
Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')

Create and plot a table which preserves the ordering of the factor

When creating and plotting a table the names are numeric values and I would like for them to stay in numeric order.
Code :
library(plyr)
set.seed(1234)
# create a random vector of different categories
number_of_categories <- 11
probability_of_each_category <- c(0.1,0.05, 0.05,0.08, 0.01,
0.1, 0.2, 0.3, 0.01, 0.02,0.08)
number_of_samples <- 1000
x <- sample( LETTERS[1:number_of_categories],
number_of_samples,
replace=TRUE,
prob=probability_of_each_category)
# just a vector of zeros and ones
outcome <- rbinom(number_of_samples, 1, 0.4)
# I want x to be 1,2,...,11 so that it demonstrates the issue when
# creating the table
x <- mapvalues(x,
c(LETTERS[1:number_of_categories]),
seq(1:number_of_categories))
# the table shows the ordering
prop.table(table(x))
plot(table(x, outcome))
Table :
> prop.table(table(x))
x
1 10 11 2 3 4 5 6 7 8 9
0.105 0.023 0.078 0.044 0.069 0.083 0.018 0.097 0.195 0.281 0.007
Plot :
I would like the plot and the table in the order
1 3 4 5 ... 10 11
Rather than
1 10 11 2 3 4 5 6 7 8 9
You can either convert x to numeric before feeding it to table
plot(table(as.numeric(x), outcome))
Or order the table's rows by the as.numeric of the rownames
t <- table(x, outcome)
t <- t[order(as.numeric(rownames(t))),]
plot(t)
A simple to solve this problem is to format the numbers to include a leading zero during mapvalues(), using sprintf().
x <- mapvalues(x,
c(LETTERS[1:number_of_categories]),
sprintf("%02d",seq(1:number_of_categories)))
# the table shows the ordering
prop.table(table(x))
plot(table(x, outcome))
...and the output:
> prop.table(table(x))
x
01 02 03 04 05 06 07 08 09 10 11
0.104 0.067 0.038 0.073 0.019 0.112 0.191 0.291 0.011 0.019 0.075

R Populate a vector by matching names to df column values

I have a named vector filled with zeros
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
I want to populate the vector with count data from a dataframe
size count
37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024
I need help finding a way to match the value for size to the vector name and then input the corresponding count value into that vector position
Might be as simple as:
toy1[ as.character(dat$size) ] <- dat$count
toy1
# 37 38 39 40 41 42 43 44 45
#1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024
R's indexing for assignments can have character values. If you had just tried to index with the raw column:
toy1[ dat$size ] <- dat$count
You would have gotten (as did I initially):
> toy1
37 38 39 40 41 42 43 44 45
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1.181 0.421
0.054 0.005 0.031 0.582 NA NA 0.024
That occurred because numeric indexing occurred and there was default extension of the length of the vector to accommodate the numbers up to 45.
With a version of the dataframe that had a number that was not in the range 37:45, I did get a warning from using match with a nomatch of 0, but I also got the expected results:
toy1[ match( as.character( dat$size), names(toy1) , nomatch=0) ] <- dat$count
#------------
Warning message:
In toy1[match(as.character(dat$size), names(toy1), nomatch = 0)] <- dat$count :
number of items to replace is not a multiple of replacement length
> toy1
37 38 39 40 41 42 43 44 45
1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.000
The match function is at the core of the merge function but this application would be much faster than a merge of dataframes
Lets say your data frame is df, then you can just update the records in toy1 for records available in your data frame:
toy1[as.character(df$size)] <- df$count
Edit: To check for a match m before updating the records. m are the matched indices in size column of df:
m <- match(names(toy1), as.character(df$size))
Then, for the indices in toy1 which have a match, it can be updated as below:
toy1[which(!is.na(m))] <- df$count[m[!is.na(m)]]
PS: Efficient way would be to define toy1 as a data frame and perform an outer join by size column.
First, let's get the data loaded in.
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
df = read.table(text="37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024")
names(df) = c("size","count")
Now, I present a really ugly solution. We only update toy1 where the name of toy1 appears in df$size. We return df$count by obtaining the index of the match in df. I use sapply to get a vector of the index back. On both sizes we only look for places where names(toy1) appear in df$size.
toy1[names(toy1) %in% df$size] = df$count[sapply(names(toy1)[names(toy1) %in% df$size],function(x){which(x == df$size)})]
But, this isn't very elegant. Instead, you could turn toy1 into a data.frame.
toydf = data.frame(toy1 = toy1,name = names(toy1),stringsAsFactors = FALSE)
Now, we can use merge to get the values.
updated = merge(toydf,df,by.x = "name",by.y="size",all.x=T)
This returns a 3 column data.frame. You can then extract the count column from this, replace NA with 0 and you're done.
updated$count[is.na(updated$count)] = 0
updated$count
#> [1] 1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024

Sum of columns of dataframe based on time period in R

I have a dataframe with multiple columns and and multiple rows. The data is based on monthly observations over the period of 11 years. Now I want to take the sum of each column based on observations for previous 12 months. For example sum of column for Jan-05 is based on its observations from Jan-04 to Dec-04. And for Feb-05 is based on observations from Feb-04 to Jan-05 and so on. My original data frame has data for 10 years and monthly data.
I illustrate part of my dataframe as follows:
df1
Month A B C
Jan-04 0.003 0.006 NA
Feb-04 0.003 0.002 NA
Mar-04 -0.005 -0.001 NA
Apr-04 0.000 0.000 NA
May-04 0.000 -0.002 NA
Jun-04 -0.001 -0.001 NA
Jul-04 -0.001 -0.001 NA
Aug-04 -0.010 NA NA
Sep-04 0.001 NA NA
Oct-04 0.002 NA NA
Nov-04 -0.003 NA NA
Dec-04 -0.003 NA NA
Jan-05 0.005 -0.002 NA
Feb-05 -0.0015 0.004 0.0003
Mar-05 -0.0041 0.002 0.0070
The desired resultant dataframe
Month A B C
Jan-05 -0.013 0.004 NA
Feb-05 -0.011 -0.004 NA
Mar-05 -0.0151 -0.0014 0.0003
Here is a solution in base R. First we define a function to subset the df based on the time difference from the date of interest and find the column sums on that subsetted df, and then we run that function for all of the time points of interest.
subset_last_year <- function(df, date, cols_to_sum = c("A", "B", "C")){
date = as.POSIXct(date, format = "%d-%b-%y")
df$Time_Difference = difftime(date, df$Month_Date, units = "weeks")
df_last_year = df[df$Time_Difference > 0 & df$Time_Difference < 53, ]
tmp_col_sum = colSums(df_last_year[ , cols_to_sum], na.rm = TRUE)
return(tmp_col_sum)
}
#oddly you have to add days
df$Month_Date = paste0("01-", df$Month)
df$Month_Date = as.POSIXct(df$Month_Date, format = "%d-%b-%y")
#not worried about performance because the data set is not that large
dates = c("01-Jan-05", "01-Feb-05", "01-Mar-05")
res = data.frame()
for(i in 1:length(dates)){
tmp = subset_last_year(df, dates[i])
res = rbind(res, tmp)
}
rownames(res) = dates
colnames(res) = c("A", "B", "C")

Resources