modification of lists with respect to dates - r

I have a list data
list <- list()
list$date <- structure(19297:19310, class = "Date")
list$value <- c(100,200,300,100,200,300,100,200,300,100,200,500,800)
list$temp2 <- c(1000,2000,3000,1000,2000,3000,1000,2000,3000,1000,2000,5888,9887)
I want to modify the list in such a way that:
every element of the list$value is multiplied with 0.5 * list$temp2 (which can be done by a multiply operation)
Except the maximum of the value that is in between days 1 to 7 of the date (maximum of first week) - this maximum value needs to be doubled. (i.e., only one list$value doesn't get replaced with the step 1 rather is doubled by its own value)
Can anyone help me with this?

Converting the list into a data.frame (or better yet a data.table) will enable column-wise operations on the data.
dl <- list()
dl$date <- structure(19297:19310, class = "Date")
dl$value <- c(100,200,300,100,200,300,100,200,300,100,200,500,800)
dl$temp2 <- c(1000,2000,3000,1000,2000,3000,1000,2000,3000,1000,2000,5888,9887)
Since the list elements are unequal length, adding NAs to the end, so there are 14 elements in each value
dl$value[[length(dl$value)+1]] <- NA
dl$temp2[[length(dl$temp2)+1]] <- NA
Convert to a data.frame
df <- as.data.frame(dl)
Create the exception criteria (max value of the first 7 days)
df$exception <- df$value == max(df[1:7,"value"])
df$exception[is.na(df$exception)] <- FALSE
Create a new variable "value2" and perform the multiplications .5 where the exception doesn't occur, and x2 where it does occur.
df$value2 <- as.numeric(NA)
df$value2[df$exception == FALSE] <- df$value[df$exception == FALSE] * 0.5
df$value2[df$exception == TRUE] <- df$value[df$exception == TRUE] * 2
The output, which can be passed back into a list object, if required
df$date
df$value2
df$temp2

Related

How to iteratively remove columns in r?

lets take an example dataframe with removal of variable columns:
frame <- data.frame("a" = 1:5, "b" = 2:6, "c" = 3:7, "d" = 4:8)
rem <- readline()
frame <- subset(frame, select = -c(rem))
How do I get the variable column to be removed? This is not my real code, just wanted to present my problem in a simple code. Thanks!
Edit: I am so sorry, I am really sleepy and don't know what I typed into my code, I edited it now.
1) Do both at once. We assume that ix contains at least one column number.
ix <- 1:2
frame[-ix]
## c d
## 1 3 4
## 2 4 5
## 3 5 6
## 4 6 7
## 5 7 8
1a) or if the case where ix is zero length, ix <- c(), is important we can do this. The output of this and all the rest are the same as for (1) so we won't repeat the output.
ix <- 1:2
frame[setdiff(seq_along(frame), ix)]
1b) or if we have names rather than column numbers. This works even if nms is a zero length vector in which case it returns the original data frame.
nms <- c("a", "b")
frame[setdiff(names(frame), nms)]
2) or if you need to do it iteratively remove the largest one first because if it were done in ascending order then after the first one is removed the second column is no longer the second but is the first. If we knew that ix is already sorted we could omit the sort. We have used frame_out to hold the result so that the input is not destroyed. This works even if ix is the empty vector.
ix <- 1:2
frame_out <- frame
for(i in rev(sort(ix))) frame_out <- frame_out[-i]
frame_out
3) One way to do it independent of order is to do it by name. In this case it would be possible to remove them in ascending order. This works even if ix the empty vector.
ix <- 1:2
nms <- names(frame)[ix]
frame_out <- frame
for(nm in nms) frame_out <- frame_out[-match(nm, names(frame_out))]
frame_out

Introducing missing values using number of IDs in R randomly

I have a dataset with many Facilities with a unique Facility ID and variables clustered at the facility ID. I would like to select a number of randomly selected IDs and then introducing missing values for a given number of reported values within the Facility.
Below is a sample of the dataset.
h <- data.frame(cbind(FacilityID = rep(1:5,each=12),X1=rnorm(60,0,1)))
The data has 5 FacilityIDs with 12 values reported for each ID for a variable X1.
I would like to perform the following;
For 2 IDs selected randomly, 3 missing values are assigned randomly within the IDs
For 1 ID selected randomly, 4 missing values are assigned randomly within the IDs
Here is a tidyverse solution.
Use sample to get the 3 IDs. sample(row_number()) <= 4 randomly selects 4 rows.
library(tidyverse)
ids <- sample(unique(h$FacilityID), 3)
h %>%
group_by(FacilityID) %>%
mutate(
X1 = case_when(
FacilityID %in% ids[1:2] & sample(row_number()) <= 3 ~ NA_real_,
FacilityID %in% ids[3] & sample(row_number()) <= 4 ~ NA_real_,
TRUE ~ X1
)
)
It's not clear whether you want these two operations to be performed together or individually.
Individually you could do:
# Set 3 values from 2 IDs to NA
for(i in sample(unique(h$FacilityID), 2)) {
h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}
# Set 4 values from 1 ID to NA:
h$X1[sample(which(h$FacilityID == sample(unique(h$FacilityID), 1)), 4)] <- NA
If you want to perform both operations at once on the same data set you can do:
IDs <- sample(unique(h$FacilityID), 3)
for(i in IDs) {
if(i == IDs[3])
h$X1[sample(which(h$FacilityID == i), 4)] <- NA
else
h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}
Base R:
# Set seed for reproducibility:
set.seed(2020)
# Store no_nas, the number of nas to introduce per facility: no_nas => integer vector
no_nas <- c(rep(3, 2), 4)
# Store n, the number of facilities to sample: n => integer scalar
n <- length(no_nas)
# Subset data.frame to records containing randomly sampled
# FacilityIDs assign NA vals: facidsample => data.frame
facidsample <- do.call(rbind, Map(function(x, y) {
i <- h[h$FacilityID == x, ]; i$X1[sample(seq_len(nrow(i)), y)] <- NA_real_; i
}, sample(unique(h$FacilityID), n), no_nas))
# Combine sampled data with original set less nullified entries: j => data.frame
j <- rbind(h[h$FacilityID %in% setdiff(h$FacilityID, facidsample$FacilityID),],
facidsample)

Substituting or summing based on condition

I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5))
And I would like to create a cumulative sum of "Val". I know how to do the simple cumulative sum
df <- df %>% group_by(id) %>% mutate(cumval=cumsum(Val))
However, I would like my final data to look like this
final <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5),
"cumval" = c(2,5,6,11,2,7,5,10))
The basic idea is that when two "Val"'s are of the same "Group" the one happening later (Year) substitutes the previous one.
For instance, in the sample dataset, observation 3 has a "cumval" of 6 rather than 8 because of the "Val" at time 1972 replaced the "Val" at time 1970. similarly for Beta.
I thank you in advance for your help
In my head, this requires a for loop. First we split the dataframe by the id column into a list of two. Then we create two empty lists. In the og list, we will put the row where the first unique non NA group identifier occurs. For alpha this is the first row and for Beta this is the second row. We will use this to subtract from the cumulative sum when the value gets substituted.
mylist <- split(df, f = df$id)
og <- list()
vals <- list()
df_num <- 1
We shall use a nested loop, the outer loop loops over each object (dataframe in this case) in the list and the inner loop loops over each value in the Group column.
We need to keep track of the row numbers, which we do with the r variable. We initially set it to 0 outside the for loop so we add 1. First we check if we are in the first row of the data frame, in which case the cumulative sum is simply equal to the value in the first row of the Val column. Then within the if test, we use another if test to check if the Group id is an NA. If it isn't then this is the first occurrence of the number that will indicate a substitution of the current value if this number appears again. So we save the number to the temporary variable temp. We also extract and save the row that contains the value to the og list.
After this it, goes to the next iteration. We check if the current Group value is NA. If it is, then we just add the value to the cumulative sum. If it isn't equal to NA, we check if the value is NA and is equal to the value stored in temp. If both are true, then this means we need to substitute. We extract the original value stored in the og list and save it as old. We then subtract the old value from the cumulative sum and add the current value. We also replace the orginal value in og with the current replacement value. This is because if the value needs to replaced again, we will need to subtract the current value and not the original value.
If j is NA but it is not equal to temp, then this is a new instance of Group. So we save the row with the original value to og list, and save the Group. The sum continues as normal as this is not an instance of replacing a value. Note that the variable x that is used to count the elements in the og list is only incremented when a new occurrence is added to the list. Thus, og[[x-1]] will always be the replacement value.
for (my_df in mylist) {
x <- 1
r <- 0
for (j in my_df$Group) {
r <- r + 1
if (r == 1) {
vals[[1]] <- my_df$Val[1]
if (is.na(j)==FALSE) {
og[[x]] <- df[r, c('Group', 'Val'), drop = FALSE]
temp <- j
x <- x + 1
}
next
}
if (is.na(j)==TRUE) {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
} else if (is.na(j)==FALSE & j==temp) {
old <- og[[x-1]]
old <- old[,2]
vals[[r]] <- vals[[r-1]] - old + df$Val[r]
og[[x-1]] <- df[r, c('Group', 'Val'), drop = FALSE]
} else {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
og[[x]] <- my_df[r, c('Group', 'Val')]
temp <- j
x <- x + 1
}
}
cumval <- unlist(vals) %>% as.data.frame()
colnames(cumval) <- 'cumval'
my_df <- cbind(my_df, cumval)
mylist[[df_num]] <- my_df
df_num <- df_num + 1
}
Lastly, we combine the two dataframes in the list by binding them on rows with bind_rows from the dplyr package. Then I check if the Final dataframe is identical to your desired output with identical() and it evaluates to TRUE
final_df <- bind_rows(mylist)
identical(final_df, final)
[1] TRUE

Effeciently replacing a variable number of NA values based on logical vector

I am attempting to replace NA values in my data frame based on the logical return of one of the columns in the data frame.
#Creating random example data frame
a <- rbinom(1000,1,.5)
b <- rbinom(1000,1,.75)
c <- rbinom(1000,1,.25)
d <- rbinom(1000,1,.5)
e <- rbinom(1000,1,.5) # Will be the logical column
df <- cbind(a,b,c,d)
for(i in 1:1000){
if(sum(df[i,1:4]) >2){
df[i,1:4] <- NA
}
}
# randomly replacing some of the NA to represent the observation data
df[sample(1:length(df), 100, replace=F)] <- 1
df <- cbind(df, e)
I am attempting to fill in the NAs with 0 when e == 1 while still retaining the random 1s I placed in the the other 4 columns (especially those where the rest of the values are NA).
I've tried creating loops like:
for(i in 1:nrow(df)){
if(df[,'e']==1){
df[i,is.na(df[i,1:4])] <- 0
}
}
however that clears both my logical column and my observation data.
The data frame that I want to apply this to is large (2.8 million rows X 23 col) containing metadata and observation data so something that takes speed into account would be great.
We can do this with data.table
library(data.table)
df1 <- as.data.frame(df)
setDT(df1)
for(j in 1:4){
set(df1, i = which(df1[['e']]==1 & is.na(df1[[j]])), j = j, value = 0)
}
It would be more efficient as we are using set. Based on the help page of set (?set) overhead of [.data.table is avoided by calling it.
As #thelatemail mentioned a compact base R option would be
df[,1:4][df[,"e"]==1 & is.na(df[,1:4])] <- 0
If the matrix is very big, the logical matrix would be big as well and that could potentially create memory-related issues.

How to convert single column data into two-column matrix using conditional/for loop in R

I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!
Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT
Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))
You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post

Resources