Altering a data frame in R - r

I have a data frame that has the first column go from 1 to 365 like this
c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2...
and the second column has times that repeat over and over again like this
c(0,30,130,200,230,300,330,400,430,500,0,30,130,200,230,300,330,400,430,500...
so for every 1 value in the first column I have a corresponding time in the second column then when I get to the 2's the times start over and each 2 has a corresponding time,
occasionally I will come across
c(3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4...
c(0,30,130,200,230,330,400,430,500,0,30,130,200,230,300,330,400,430,500...
Here one of the 3's is missing and the corresponding time of 300 is missing with it.
How can I go through my entire data frame and add these missing values? I need a way for R to go through and identify any missing values then insert a row and put the appropriate value, 1 to 365, in column one and the appropriate time with it. So for the given example R would add a row in between 230 and 330 and then place a 3 in the first column and 300 in the second. There are parts of the column that are missing several consecutive values. It is not just one here and there

EDIT: Solution with all 10 times clearly specified in advance and code tidy up/commenting
You need to create another data.frame containing every possible row and then merge it with your data.frame. The key aspect is the all.x = TRUE in the final merge which forces the gaps in your data to be highlighted. I simulated the gaps by sampling only 15 of the first 20 possible day/time combinations in your.dat
# create vectors for the days and times
the.days = 1:365
the.times = c(0,30,100,130,200,230,330,400,430,500) # the 10 times to repeat
# create a master data.frame with all the times repeated for each day, taking only the first 20 observations
dat.all = data.frame(x1=rep(the.days, each=10), x2 = rep(the.times,times = 365))[1:20,]
# mimic your data.frame with some gaps in it (only 15 of 20 observations are present)
your.sample = sample(1:20, 15)
your.dat = data.frame(x1=rep(the.days, each=10), x2 = rep(the.times,times = 365), x3 = rnorm(365*10))[your.sample,]
# left outer join merge to include ALL of the master set and all of your matching subset, filling blanks with NA
merge(dat.all, your.dat, all.x = TRUE)
Here is the output from the merge, showing all 20 possible records with the gaps clearly visible as NA:
x1 x2 x3
1 1 0 NA
2 1 30 1.23128294
3 1 100 0.95806838
4 1 130 2.27075361
5 1 200 0.45347199
6 1 230 -1.61945983
7 1 330 NA
8 1 400 -0.98702883
9 1 430 NA
10 1 500 0.09342522
11 2 0 0.44340164
12 2 30 0.61114408
13 2 100 0.94592127
14 2 130 0.48916825
15 2 200 0.48850478
16 2 230 NA
17 2 330 0.52789171
18 2 400 -0.16939587
19 2 430 0.20961745
20 2 500 NA

Here are a few NA handling functions that could help you getting started.
For the inserting task, you should provide your own data using dput or a reproducible example.
df <- data.frame(x = sample(c(1, 2, 3, 4), 100, replace = T),
y = sample(c(0,30,130,200,230,300,330,400,430,500), 100, replace = T))
nas <- sample(NA, 20, replace = T)
df[1:20, 1] <- nas
df$y <- ifelse(df$y == 0, NA, df$y)
# Columns x and y have NA's in diferent places.
# Logical test for NA
is.na(df)
# Keep not NA cases of one colum
df[!is.na(df$x),]
df[!is.na(df$y),]
# Returns complete cases on both rows
df[complete.cases(df),]
# Gives the cases that are incomplete.
df[!complete.cases(df),]
# Returns the cases without NAs
na.omit(df)

Related

Transform table [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I would like to repeat entire rows in a data-frame based on the samples column.
My input:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text=df, header=TRUE)
My expected output:
df <- 'chr start end samples
1 10 20 1-10-20-s1
1 10 20 1-10-20-s2
2 4 10 2-4-10-s1
2 4 10 2-4-10-s2
2 4 10 2-4-10-s3'
Some idea how to perform it wisely?
We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.
library(splitstackshape)
setDT(expandRows(df, "samples"))[,
samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
# chr start end samples
#1: 1 10 20 1-10-20-s1
#2: 1 10 20 1-10-20-s2
#3: 2 4 10 2-4-10-s1
#4: 2 4 10 2-4-10-s2
#5: 2 4 10 2-4-10-s3
NOTE: data.table will be loaded when we load splitstackshape.
You can achieve this using base R (i.e. avoiding data.tables), with the following code:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text = df, header = TRUE)
duplicate_rows <- function(chr, starts, ends, samples) {
expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)
new_df <- do.call(rbind, expanded_rows)
The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.
The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.
Example using DataFrame function from S4Vector package:
df <- DataFrame(x=c('a', 'b', 'c', 'd', 'e'), y=1:5)
rep(df, df$y)
where y column represents the number of times to repeat its corresponding row.
Result:
DataFrame with 15 rows and 2 columns
x y
<character> <integer>
1 a 1
2 b 2
3 b 2
4 c 3
5 c 3
... ... ...
11 e 5
12 e 5
13 e 5
14 e 5
15 e 5

R code for repeating value into column

I am basically new to using R software.
I have a list of repeating codes (numeric/ categorical) from an excel file. I need to add another column values (even at random) to which every same code will get the same value.
Codes Value
1 122
1 122
2 155
2 155
2 155
4 101
4 101
5 251
5 251
Thank you.
We can use match:
n <- length(code0 <- unique(code))
value <- sample(4 * n, n)[match(code, code0)]
or factor:
n <- length(unique(code))
value <- sample(4 * n, n)[factor(code)]
The random integers generated are between 1 and 4 * n. The number 4 is arbitrary; you can also put 100.
Example
set.seed(0); code <- rep(1:5, sample(5))
code
# [1] 1 1 1 1 1 2 2 3 3 3 3 4 4 4 5
n <- length(code0 <- unique(code))
sample(4 * n, n)[match(code, code0)]
# [1] 5 5 5 5 5 18 18 19 19 19 19 12 12 12 11
Comment
The above gives the most general treatment, assuming that code is not readily sorted or taking consecutive values.
If code is sorted (no matter what value it takes), we can also use rle:
if (!is.unsorted(code)) {
n <- length(k <- rle(code)$lengths)
value <- rep.int(sample(4 * n, n), k)
}
If code takes consecutive values 1, 2, ..., n (but not necessarily sorted), we can skip match or factor and do:
n <- max(code)
value <- sample(4 * n, n)[code]
Further notice: If code is not numerical but categorical, match and factor method will still work.
What you could also do is the following, it is perhaps more intuitive to a beginner:
data <- data.frame('a' = c(122,122,155,155,155,101,101,251,251))
duplicates <- unique(data)
duplicates[, 'b'] <- rnorm(nrow(duplicates))
data <- merge(data, duplicates, by='a')

R Loop with conditions

I have a series of repeated IDs that I would like to assign to groups with fix size. The subject IDs repeat with different frequencies for example:
# Example Data
ID = c(101,102,103,104)
Repeats = c(2,3,1,3)
Data = data.frame(ID,Repeats)
> head(Data)
ID Repeats
1 101 2
2 102 3
3 103 1
4 104 3
I would like the same repeated ID to stay within the same group. However, each group has a fixed capacity (say 3 only). For example, in my desired output matrix each group can only accommodate 3 IDs:
# Create empty data frame for group annotation
# Add 3 rows in order to have more space for IDs
# Some groups will have NAs due to keeping IDs together (I'm OK with that)
Target = data.frame(matrix(NA,nrow=(sum(Data$Repeats)+3),
ncol=dim(Data)[2]))
names(Target)<-c("ID","Group")
Target$Group<-rep(1:3)
Target$Group<-sort(Target$Group)
> head(Target)
ID Group
1 NA 1
2 NA 1
3 NA 1
4 NA 1
5 NA 2
6 NA 2
I can loop each ID to my Target data frame but this does not guarantee that repeated IDs will stay together in the same group:
# Loop repeated IDs the groups
IDs.repeat = rep(Data$ID, times=Data$Repeats)
# loop IDs to Targets to assign IDs to groups
for (i in 1:length(IDs.repeat))
{
Target$ID[i]<-IDs.repeat[i]
}
In my example in the loop above I get the same ID (102) across two different groups (1 and 2), I would like to avoid this!:
> head(Target)
ID Group
1 101 1
2 101 1
3 102 1
4 102 1
5 102 2
6 103 2
Instead I want the output to look the code to put NA if there is no space for that ID in that group.
> head(Target)
ID Group
1 101 1
2 101 1
3 NA 1
4 NA 1
5 102 2
6 102 2
Anyone has a solution for IDs to stay within the same group if there is sufficient space after assigning ID i?
I think that I need a loop and count NAs within that group and see if the NAs>= to the length of that unique ID. However, I don't know how to implement this simultaneously. Maybe nesting another loop for the j group?
Any help with the loop will be appreciated immensely!
Here's one solution,
## This is the data.frame I'll try to match
target <- data.frame(
ID = c(
rep(101, 2),
rep(102, 3),
rep(103, 1),
rep(104, 3)),
Group = c(
rep(1L, 6), # "L", like 1L makes it an int type rather than numeric
rep(2L, 3)
)
)
print(target)
## Your example data
ID = c(101,102,103,104)
Repeats = c(2,3,1,3)
Data = data.frame(ID,Repeats)
head(Data)
ids_to_group <- 3 # the number of ids per group is specified here.
Data$Group <- sort(
rep(1:ceiling(length(Data$ID) / ids_to_group),
ids_to_group))[1:length(Data$ID)]
# The do.call(rbind, lapply(x = a series, FUN = function(x) { }))
# pattern is a really useful way to stack data.frames
# lapply is basically a fancy for-loop check it out by sending
# ?lapply to the console (to view the help page).
output <- do.call(
rbind,
lapply(unique(Data$ID), FUN = function(ids) {
print(paste(ids, "done.")) # I like to put print statements to follow along
obs <- Data[Data$ID == ids, ]
data.frame(ID = rep(obs$ID, obs$Repeats))
})
)
output <- merge(output, Data[,c("ID", "Group")], by = "ID")
identical(target, output) # returns true if they're equivalent
# For example inspect each with:
str(target)
str(output)

Error counting non-NA entries in dataframe

I am trying to see if the amount of information that I have about a case is correlated to the duration of the user.
Currently, I have a dataframe, df, and I attempted to do the following:
df["amount_known"] <-df[rowSums(!is.na(df)),]
This resulted in the following error:
Error in [<-.data.frame(*tmp*, "amount_known", value = list(status = c(3L, :
replacement element 1 has 808047 rows, need 808247
What could cause this to happen (and of course, how do I fix it)?
If you want the number of non-NA entries in a new column amount_known in df you can do it like this:
df$amount_known <-rowSums(!is.na(df))
Here's a small example of what is happening:
df <- data.frame(x = 1:3, y = 66:68)
df$y[1] <- NA
df$x[3] <- NA
df
# x y
#1 1 NA
#2 2 67
#3 NA 68
rowSums(!is.na(df))
#[1] 1 2 1
This results in a vector with the number of non-NAs in df.
Now, if you do
df[rowSums(!is.na(df)),]
This will select the rows in the vector c(1,2,1) from df:
# x y
#1 1 NA
#2 2 67
#1.1 1 NA
So for example, row 1 is shown twice.
And in your code, you were then assigning that output to a new column in df.

Reordering rows in a dataframe by multiple column permutations

I am trying to reorder a data.frame that contains around 250,000 rows and 7 columns. The rows I want at the top of the data.frame are those where column 2 contains the lowest value and column 7 the highest but would go in this sequence of columns that contain the lowest to highest values: 2,5,1,4,6,3,7 (so column 5 would have the second lowest value etc.).
Once any rows that match this sequence are identified it would move on to find rows where the columns values go from lowest to highest in the sequence 2,5,1,4,6 and then 2,5,1,4 and so on until only rows where column 2 is the lowest and the other column values are randomly assorted. Any row that does not have column 2 as the lowest value would be ignored and left unsorted below the sorted rows. I am struggling to come up with any workable solution to my problem - the best I can do in terms of providing similar data to that I am working with is this:
df<-data.frame(matrix(rnorm(70000), nrow=10000))
df<-abs(df)
If anyone has any ideas, I am all ears.
Thanks!
Given that you have a largish dataset of uniform type (numeric), I would suggest using a matrix not a data.frame
tt <- abs(matrix(rnorm(70000), nrow=10000))
You have a desired order you wish to match against
desiredOrder <- c(2,5,1,4,6,3,7)
You need to find what order each of your rows is in . I think it is easiest here to ensure that you are given a list back with an element for each row. I'd suggest something like this .
orders <- lapply(apply(tt, 1, function(x) list(order(x))), unlist)
You will then need to go through (from desiredOrder[seq_len(7)] to desiredOrder[seq_len(1)] to test when the required subset of the order for a particular row is equal to the required subset of desired order. (I thinking some combination of sapply with which and all)
Once you have identified all the rows that match your required result, you can use setdiff to find the unmatched ones, and then reorder tt using this new order vector.
One possible approach would be to weight rankings of the values in the columns. It would be something like rank regression. 7 columns of 250K rows is not that big. For the ones you want the low values to have higher weight you could either subtract the rank from NROW(dfrm). If you want to scale the wieighting across that column ordering scheme then jsut multiply by a weighting vector: say c(1, .6, .3, 0, .3, .6, 1)
dmat <- matrix(sample(20, 20*7, repl=TRUE), 20, 7)
dfrm <- as.data.frame(dmat)
dfrm$wt <- sapply( dfrm[ c(2,5,1,4,6,3,7)] , rank); dfrm
dfrm$wt[,1:3] <- rep(NROW(dfrm),3) - dfrm$wt[ , 1:3]
dfrm$wt <- dfrm$wt*rep(c(1, .6, .3, 0, .3, .6, 1), each=NROW(dfrm))
dfrm[ order( apply( dfrm$wt, 1, FUN=sum), decreasing=TRUE ) , ]
This does not force the lowest value for V2 to be first, since you implied a multiple criterion. You still have the ability to re-weight if this is not exactly what you imagined.
Like this:
dat <- as.matrix(df)
rnk <- t(apply(dat, 1, rank))
desiredRank <- order(c(2,5,1,4,6,3,7))
rnk.match <- rnk == matrix(desiredRank, nrow(rnk), ncol(rnk), byrow = TRUE)
match.score <- apply(rnk.match, 1, match, x = FALSE) - 1
match.score[is.na(match.score)] <- ncol(dat)
out <- dat[order(match.score, decreasing = TRUE), ]
head(out)
# X1 X2 X3 X4 X5 X6 X7
#[1,] 0.7740246 0.19692680 1.5665696 0.9623104 0.2882492 1.367786 1.8644204
#[2,] 0.5895921 0.00498982 1.7143083 1.2698382 0.1776051 2.494149 1.4216615
#[3,] 0.1981111 0.11379934 1.0379619 0.2130251 0.1660568 1.227547 0.9248101
#[4,] 0.7507257 0.23353923 1.6502192 1.2232615 0.7497352 2.032547 1.4409475
#[5,] 0.5418513 0.06987903 1.8882399 0.6923557 0.3681018 2.172043 1.2215323
#[6,] 0.1731943 0.01088604 0.6878847 0.2450998 0.0125614 1.197478 0.3087192
In this example, the first row matches the whole rank sequence; the next rows match the first five ranks of the sequence:
head(match.score[order(match.score, decreasing = TRUE)])
# [1] 7 5 5 5 5 5
You can use the fact that order() returns the index to the ordering,
which is exactly what you are trying to match
For example if we apply `order` twice to each row of
[1,] 23 17 118 57 20 66 137
[2,] 56 42 52 66 47 8 29
[3,] 35 5 76 35 29 217 89
We would get
[1,] 2 5 1 4 6 3 7
[2,] 6 7 2 5 3 1 4
[3,] 2 5 1 4 3 7 6
Then you simply need to check which rows match what you are looking for.
There are several ways to implement this, below is an example, where we create
a logical matrix, comparisons, which indicates whether each element of a row
is in the "correct" position, as indicated by expectedOrder.
We then order the original rows by how many elements are in the "correct column". (using this phrase loosely, of course)
# assuming mydf is your data frame or matrix
# the expected order of the columns
expectedOrder <- c(2,5,1,4,6,3,7)
# apply the order function twice.
ordering <- apply(mydf, 1, function(r) order(r) )
# Recall that the output of apply is transposed relative to the input.
# We make use of this along with the recycling of vectors for the comparison
comparisons <- ordering == expectedOrder
# find all rows with at least matches to 2,5,1,4
topRows <- which(colSums(comparisons[1:4, ])==4)
# reorder the indecies based on the total number of matches in comparisons
# ie: first all 7-matches, then 5-matches, then 4-matches
topRows <- topRows[order(colSums(comparisons[,topRows]), decreasing=TRUE)]
# reorder the dataframe (or matrix)
mydf.ordered <-
rbind(mydf[topRows, ],
mydf[-topRows,])
head(mydf.ordered)
# X1 X2 X3 X4 X5 X6 X7
# 23 17 118 57 20 66 137
# 39 21 102 50 24 53 163
# 80 6 159 116 44 139 248
# 131 5 185 132 128 147 202
# 35 18 75 40 33 67 151
# 61 14 157 82 57 105 355

Resources