I have a dataset with many Facilities with a unique Facility ID and variables clustered at the facility ID. I would like to select a number of randomly selected IDs and then introducing missing values for a given number of reported values within the Facility.
Below is a sample of the dataset.
h <- data.frame(cbind(FacilityID = rep(1:5,each=12),X1=rnorm(60,0,1)))
The data has 5 FacilityIDs with 12 values reported for each ID for a variable X1.
I would like to perform the following;
For 2 IDs selected randomly, 3 missing values are assigned randomly within the IDs
For 1 ID selected randomly, 4 missing values are assigned randomly within the IDs
Here is a tidyverse solution.
Use sample to get the 3 IDs. sample(row_number()) <= 4 randomly selects 4 rows.
library(tidyverse)
ids <- sample(unique(h$FacilityID), 3)
h %>%
group_by(FacilityID) %>%
mutate(
X1 = case_when(
FacilityID %in% ids[1:2] & sample(row_number()) <= 3 ~ NA_real_,
FacilityID %in% ids[3] & sample(row_number()) <= 4 ~ NA_real_,
TRUE ~ X1
)
)
It's not clear whether you want these two operations to be performed together or individually.
Individually you could do:
# Set 3 values from 2 IDs to NA
for(i in sample(unique(h$FacilityID), 2)) {
h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}
# Set 4 values from 1 ID to NA:
h$X1[sample(which(h$FacilityID == sample(unique(h$FacilityID), 1)), 4)] <- NA
If you want to perform both operations at once on the same data set you can do:
IDs <- sample(unique(h$FacilityID), 3)
for(i in IDs) {
if(i == IDs[3])
h$X1[sample(which(h$FacilityID == i), 4)] <- NA
else
h$X1[sample(which(h$FacilityID == i), 3)] <- NA
}
Base R:
# Set seed for reproducibility:
set.seed(2020)
# Store no_nas, the number of nas to introduce per facility: no_nas => integer vector
no_nas <- c(rep(3, 2), 4)
# Store n, the number of facilities to sample: n => integer scalar
n <- length(no_nas)
# Subset data.frame to records containing randomly sampled
# FacilityIDs assign NA vals: facidsample => data.frame
facidsample <- do.call(rbind, Map(function(x, y) {
i <- h[h$FacilityID == x, ]; i$X1[sample(seq_len(nrow(i)), y)] <- NA_real_; i
}, sample(unique(h$FacilityID), n), no_nas))
# Combine sampled data with original set less nullified entries: j => data.frame
j <- rbind(h[h$FacilityID %in% setdiff(h$FacilityID, facidsample$FacilityID),],
facidsample)
Related
I have a list data
list <- list()
list$date <- structure(19297:19310, class = "Date")
list$value <- c(100,200,300,100,200,300,100,200,300,100,200,500,800)
list$temp2 <- c(1000,2000,3000,1000,2000,3000,1000,2000,3000,1000,2000,5888,9887)
I want to modify the list in such a way that:
every element of the list$value is multiplied with 0.5 * list$temp2 (which can be done by a multiply operation)
Except the maximum of the value that is in between days 1 to 7 of the date (maximum of first week) - this maximum value needs to be doubled. (i.e., only one list$value doesn't get replaced with the step 1 rather is doubled by its own value)
Can anyone help me with this?
Converting the list into a data.frame (or better yet a data.table) will enable column-wise operations on the data.
dl <- list()
dl$date <- structure(19297:19310, class = "Date")
dl$value <- c(100,200,300,100,200,300,100,200,300,100,200,500,800)
dl$temp2 <- c(1000,2000,3000,1000,2000,3000,1000,2000,3000,1000,2000,5888,9887)
Since the list elements are unequal length, adding NAs to the end, so there are 14 elements in each value
dl$value[[length(dl$value)+1]] <- NA
dl$temp2[[length(dl$temp2)+1]] <- NA
Convert to a data.frame
df <- as.data.frame(dl)
Create the exception criteria (max value of the first 7 days)
df$exception <- df$value == max(df[1:7,"value"])
df$exception[is.na(df$exception)] <- FALSE
Create a new variable "value2" and perform the multiplications .5 where the exception doesn't occur, and x2 where it does occur.
df$value2 <- as.numeric(NA)
df$value2[df$exception == FALSE] <- df$value[df$exception == FALSE] * 0.5
df$value2[df$exception == TRUE] <- df$value[df$exception == TRUE] * 2
The output, which can be passed back into a list object, if required
df$date
df$value2
df$temp2
Given a 3 x 100 matrix, how could I create a vector of strings containing individual column names? Specifically, columns comprise 20 sets of 5 consecutive measures and therefore strings should match variable (i.e. varA, ... varC), sets (SET1 to SET20) and order (1 to 5). For example:
my_matrix = replicate(100, rnorm(3))
my_names <- c("varA.SET1.1", "varA.SET1.2", "varA.SET1.3", "varA.SET1.4", "varA.SET1.5",
"varA.SET2.1", "varA.SET2.2", "varA.SET2.3", "varA.SET2.4", "varA.SET2.5",
...
"varC.SET5.5")
You can use sprintf.
v <- LETTERS[1:3]
set <- 1:20
ord <- 1:5
ex <- expand.grid(v, set, ord)
my_names <- sprintf("var%s.SET%i.%i", ex[, 1],ex[, 2], ex[, 3])
head(my_names)
#[1] "varA.SET1.1" "varB.SET1.1" "varC.SET1.1" "varA.SET2.1" "varB.SET2.1"
#[6] "varC.SET2.1"
Using R, I have to extract specific rows from a data frame depending on certain conditions. The data frame is large (5.5 million rows to 251 columns) but I have given the code below to create a sample data frame.
df <- data.frame("Name" = c("Name1", "Name1", "Name1", "Name1","Name1" ), "Value"=c("X", "X", "Y", "Y", "X"))
I need to skip through the entire data frame row by row starting at the top, and while skipping, when the value of the 'Value' column changes from X to Y or Y to X, I need to extract that row and next row and append them to another data frame. For example, in the data frame above, the Value column of row 2 is X and that of row 3 is Y, and since the value has changed from X to Y, I need to extract the entire row 2 and row 3 and add them to another data frame.
The result of the operations can be seen by running the code below
dfextract <- data.frame("Name" = c("Name1", "Name1"), "Value"=c("X", "Y"))
Currently I have used a 'for' loop to skip row to row and extract the rows when the values don't match. But it very slow and inefficient. The code snippet is below
for (i in 1:count) {
if (df[[i+1, 2]] != df[i,2]) {
dfextract <- rbind(dfextract, df[i,])
dfextract <- rbind(dfextract, df[i+1,])
}
}
I am looking for a better and faster solution to the above situation. Perhaps using the functions belonging to the family of 'apply()' or using 'by()'. Any help would be greatly appreciated.
Thanks in advance.
Maybe the following does it. Note that there are two lapply based loop, in order to predict for changes in the values of column Name.
diffstr <- function(x) x[-1] == x[-length(x)]
res <- lapply(split(df, df$Name), function(x) {
inx <- which(c(FALSE, !diffstr(x$Value)))
do.call(rbind, lapply(inx, function(i) x[(i - 1):i, ]))
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
How it works.
First, I define a helper function diffstr. It compares all values of x but the first with all values of x but the last. Note that x[-1] is the vector x[2], x[3], ..., x[length(x)], negative indices remove that element from the vector. And the same for x[-length(x), the negative index removes the last x.
split(df, df$Name) splits the data frame into subsets each one of its own Name.
I then lapply an unnamed function to these subsets. This function's argument x will be each of the sub-data frames mentioned above.
That function start by determining where in df$Valueare the changes. This is done with the call to the helper function diffstr. I have to append a FALSE to the return value because at first there are no changes.
The next line is a tricky one. Use lapply on the index of change points inx and for each one get a two rows segment of the data frame x. Then use do.call to call rbind those two rows df's and reassemble them together.
Now res is a list, with one sub-data frame for each Name (done with the split). So it needs to be put back together with another call to do.call(rbind(...)).
Final tidy up. The whole process messed up with the data frame's row names. To set them to NULL is just a well known trick that forces R to renumber the rows.
That's it. If you need more explanations, just say so.
We can use dplyr. lag can shift the row by 1, so we can use Value != lag(Value) to compare if the value is different than the previous one. which(Value != lag(Value)) converts the result to row number. After that, sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))) makes sure we also got the row number of those previous rows. Finally, slice can subset the data frame based on the row number provided.
library(dplyr)
df2 <- df %>%
slice(sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))))
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
If the code is too long to read, you can also calculate the index before using the slice function as follows.
library(dplyr)
ind <- which(df$Value != lag(df$Value))
ind2 <- sort(unique(c(ind, ind - 1)))
df2 <- df %>% slice(ind2)
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
Using base R, I would probably use an id for the rows and with diff:
df <- data.frame(colA=c(1, 1, 1, 2, 1, 1, 1, 3, 3, 3, 1, 1),
colB=1:12)
keep <- which(diff(df$colA) != 0)
df[unique(c(keep, keep+1)), ]
colA colB
3 1 3
4 2 4
7 1 7
10 3 10
5 1 5
8 3 8
11 1 11
There is probably a faster option though.
When you have a large dataset, speed might be the bottleneck. In this case data.table might be the best option for you.
Using the data.table-library, I would solve it like so:
library(data.table)
dt <- data.table(Name = c("Name1", "Name1", "Name1", "Name1","Name1" ),
Value = c("X", "X", "Y", "Y", "X"))
# look if Value changes to the next instance
dt[, idx := Value != shift(Value, 1, fill = dt$Value[1])]
# filter the rows where the index changes and the next value
# and deselect the variable idx
dt[idx | shift(idx, 1)][, .(Name, Value)]
#> Name Value
#> 1: Name1 Y
#> 2: Name1 Y
#> 3: Name1 X
Why does it give an odd-number and not an even-number?
Well, that is because in your data example, the last row should be selected as it changes, but there is no next row to select as well.
I have a table, called table_wo_nas, with multiple columns, one of which is titled ID. For each value of ID there are many rows. I want to write a function that for input x will output a data frame containing the number of rows for each ID, with column headers ID and nobs respectively as below for x <- c(2,4,8).
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
This is what I have. It works when x is a single value (ex. 3), but not when it contains multiple values, for example 1:10 or c(2,5,7). I receive the warning "In ID[counter] <- x : number of items to replace is not a multiple of replacement length". I've just started learning R and have been struggling with this for a week and have searched manuals, this site, Google, everything. Can someone help please?
counter <- 1
ID <- vector("numeric") ## contain x
nobs <- vector("numeric") ## contain nrow
for (i in x) {
r <- subset(table_wo_nas, ID %in% x) ## create subset for rows of ID=x
ID[counter] <- x ## add x to ID
nobs[counter] <- nrow(r) ## add nrow to nobs
counter <- counter + 1 } ## loop
result <- data.frame(ID, nobs) ## create data frame
In base R,
# To make a named vector, either:
tmp <- sapply(split(table_wo_nas, table_wo_nas$ID), nrow)
# OR just:
tmp <- table(table_wo_nas$ID)
# AND
# arrange into data.frame
nobs_df <- data.frame(ID = names(tmp), nobs = tmp)
Alternately, coerce the table into a data.frame directly, and rename:
nobs_df <- data.frame(table(table_wo_nas$ID))
names(nobs_df) <- c('ID', 'nobs')
If you only want certain rows, subset:
nobs_df[c(2, 4, 8), ]
There are many, many more options; these are just a few.
With dplyr,
library(dplyr)
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n())
If you only want certain IDs, add on a filter:
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n()) %>% filter(ID %in% c(2, 4, 8))
Seems pretty straightforward if you just use table again:
tbl <- table( table_wo_nas[ , 'ID'] )
data.frame( IDs = names(tbl), nobs= tbl)
Could also get a quick answer although with different column names using:
as.data.frame(table( table_wo_nas[ , 'ID'] ))
Try this.
x=c(2,4,8)
count_of_id=0
#df is your data frame table_wo_nas
count_of<-function(x)
{for(i in 1 : length(x))
{count_of_id[i]<-length(which(df$id==x[i])) #find out the n of rows for each unique value of x
}
df_1<-cbind(id,count_of_id)
return(df_1)
}
I have a vector of values r as follows:
r<-c(1,3,4,6,7)
and a data frame df with 20 records and two columns:
id<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,1,4,15,16,17,18,19,20)
freq<-c(1,3,2,4,5,6,6,7,8,3,3,1,6,9,9,1,1,4,3,7,7)
df<-data.frame(id,freq)
Using the r vector I need to extract a sample of records (in the form of a new data frame) from df in a way that the freq values of the records, would be equal to the values I have in my r vector. Needless to say that if it finds multiple records with the same freq values it should randomly pick one of them. For instance one possible outcome can be:
id frequency
12 1
10 3
4 4
7 6
8 7
I would be thankful if anyone could help me with this.
You could try data.table
library(data.table)
setDT(df)[freq %in% r,sample(id,1L) , freq]
Or using base R
aggregate(id~freq, df, subset=freq %in% r, FUN= sample, 1L)
Update
If you have a vector "r" with duplicate values and want to sample the data set ('df') based on the length of unique elements in 'r'
r <-c(1,3,3,4,6,7)
res <- do.call(rbind,lapply(split(r, r), function(x) {
x1 <- df[df$freq %in% x,]
x1[sample(1:nrow(x1),length(x), replace=FALSE),]}))
row.names(res) <- NULL
You can use filter and sample_n from "dplyr":
library(dplyr)
set.seed(1)
df %>%
filter(freq %in% r) %>%
group_by(freq) %>%
sample_n(1)
# Source: local data frame [5 x 2]
# Groups: freq
#
# id freq
# 1 12 1
# 2 10 3
# 3 17 4
# 4 13 6
# 5 8 7
Have you tried using the match() function or %in%? This might not be a fast/clean solution, but uses only base R functions:
rUnique <- unique(r)
df2 <- df[df$freq %in% rUnique,]
x <- data.frame(id = NA, freq = rUnique)
for (i in 1:length(rUnique)) {
x[i,1] <- sample(df2[df2[, 2] == rUnique[i], 1], 1)
}
print(x)