Fast creation of data.frame - r

Is there a way to create a data.frame faster or smarter than the one I made below?
df <- data.frame(ID = rep(c("WT", "KO"), each = 4),
Time = rep(c("A", "B", "C", "D"), times = 2),
replicate(5,sample(0:100,8,rep=TRUE)))
colnames(df)<-c("ID", "Time", c("a", "b", "c", "d", "e"))
The data.frame should still look like this
df
ID Time a b c d e
WT A 28 56 50 60 15
WT B 54 77 11 67 34
WT C 53 8 87 62 55
WT D 30 73 47 82 1
KO A 24 83 14 17 36
KO B 91 83 72 41 4
KO C 79 17 76 21 54
KO D 41 40 77 49 92
Thanks

You can just use expand.grid for the non numeric unique combinations (sometimes you can even make use of built it data sets such LETTERS) and run sample only once while wrapping it up into a matrix, something like
set.seed(123)
data.frame(expand.grid(c("WT", "KO"), LETTERS[1:4]),
matrix(sample(40), ncol = 5))
# Var1 Var2 X1 X2 X3 X4 X5
# 1 WT A 12 36 6 11 24
# 2 KO A 31 15 1 27 13
# 3 WT B 16 29 8 22 25
# 4 KO B 33 14 21 28 26
# 5 WT C 34 19 32 4 20
# 6 KO C 2 38 37 35 7
# 7 WT D 18 3 40 10 5
# 8 KO D 30 23 17 9 39
For less specific cases, I would recommend looking into #TylerRinkers wakefield package which allows you to generate random data sets easily.
Just for general information, using data.table v 1.9.5+ you can now set new column names by reference using setnames. For, example if your new data set is called res, one could simply do
library(data.table) # v1.9.5+
setnames(res, c("ID", "Time", letters[1:5]))

Related

How to apply the function to each row?

I want to generate 4 new columns from an existing variable total by random sampling. the results for each row should meet the condition s1 + s2 + s3 + s4 == total. Fro example,
> tabulate(sample.int(4, 100, replace = TRUE))
[1] 22 21 27 30
The following code does not work since the function appears to recycle the first row and applies it column-wise.
DT <- data.table(total = c(100, 110, 90, 92))
DT[, c(paste0("s", 1:4)) := tabulate(sample.int(4, total, replace = TRUE))]
> DT
total s1 s2 s3 s4
1: 100 31 31 31 31
2: 110 25 25 25 25
3: 90 22 22 22 22
4: 92 22 22 22 22
How to get around this? I am clearly missing some basic understanding on how R vector/list work. Your help will be much appreciated.
Edited following edited question:
data.table will expect a list internally when you want to assign to many columns. To get it so each row is unique, then you can do that by adding a by each row:
DT <- data.table(total = c(100, 110, 90, 102, 92))
DT[, c(paste0("s", 1:4)) := {
as.list(tabulate(sample.int(4, total, replace = TRUE)))
}, by = seq(NROW(DT))]
Which outputs the following, satisfying the OP criteria:
> DT
total s1 s2 s3 s4
1: 100 27 28 28 17
2: 110 25 23 36 26
3: 90 26 19 26 19
4: 102 28 24 21 29
5: 92 17 27 22 26
> apply(DT[, 2:5],1, sum)
[1] 100 110 90 102 92
Maybe you can try the code below
DTout <- cbind(
DT,
do.call(
rbind,
lapply(DT$total, function(x) diff(sort(c(0, sample(x - 1, 3), x))))
)
)
which gives
total V1 V2 V3 V4
1: 100 51 5 17 27
2: 110 41 1 40 28
3: 90 32 34 14 10
4: 102 5 73 13 11
5: 92 17 13 17 45
Test
> rowSums(DTout[,-1])
[1] 100 110 90 102 92

Sorting one variable in a data frame by id

I have a data frame with lot of company information separated by an id variable. I want to sort one of the variables and repeat it for every id. Let's take this example,
df <- structure(list(id = c(110, 110, 110, 90, 90, 90, 90, 252, 252
), var1 = c(26, 21, 54, 10, 18, 9, 16, 54, 39), var2 = c(234,
12, 43, 32, 21, 19, 16, 34, 44)), .Names = c("id", "var1", "var2"
), row.names = c(NA, -9L), class = "data.frame")
Which looks like this
df
id var1 var2
1 110 26 234
2 110 21 12
3 110 54 43
4 90 10 32
5 90 18 21
6 90 9 19
7 90 16 16
8 252 54 34
9 252 39 44
Now, I want to sort the data frame according to var1 by the vector id. Easiest solution I can think of is using apply function like this,
> apply(df, 2, sort)
id var1 var2
[1,] 90 9 12
[2,] 90 10 16
[3,] 90 16 19
[4,] 90 18 21
[5,] 110 21 32
[6,] 110 26 34
[7,] 110 39 43
[8,] 252 54 44
[9,] 252 54 234
However, this is not the output I am seeking. The correct output should be,
id var1 var2
1 110 21 12
2 110 26 234
3 110 54 43
4 90 9 19
5 90 10 32
6 90 16 16
7 90 18 21
8 252 39 44
9 252 54 34
Group by id and sort by var1 column and keep original id column order.
Any idea how to sort like this?
Note. As mentioned by Moody_Mudskipper, there is no need to use tidyverse and can also be done easily with base R:
df[order(ordered(df$id, unique(df$id)), df$var1), ]
A one-liner tidyverse solution w/o any temp vars:
library(tidyverse)
df %>% arrange(ordered(id, unique(id)), var1)
# id var1 var2
# 1 110 26 234
# 2 110 21 12
# 3 110 54 43
# 4 90 10 32
# 5 90 18 21
# 6 90 9 19
# 7 90 16 16
# 8 252 54 34
# 9 252 39 44
Explanation of why apply(df, 2, sort) does not work
What you were trying to do is to sort each column independently. apply runs over the specified dimension (2 in this case which corresponds to columns) and applies the function (sort in this case).
apply tries to further simplify the results, in this case to a matrix. So you are getting back a matrix (not a data.frame) where each column is sorted independently. For example this row from the apply call:
# [1,] 90 9 12
does not even exist in the original data.frame.
Another base R option using order and match
df[with(df, order(match(id, unique(id)), var1, var2)), ]
# id var1 var2
#2 110 21 12
#1 110 26 234
#3 110 54 43
#6 90 9 19
#4 90 10 32
#7 90 16 16
#5 90 18 21
#9 252 39 44
#8 252 54 34
We can convert the id to factor in order to split while preserving the original order. We can then loop over the list and order, and rbind again, i.e.
df$id <- factor(df$id, levels = unique(df$id))
do.call(rbind, lapply(split(df, df$id), function(i)i[order(i$var1),]))
# id var1 var2
#110.2 110 21 12
#110.1 110 26 234
#110.3 110 54 43
#90.6 90 9 19
#90.4 90 10 32
#90.7 90 16 16
#90.5 90 18 21
#252.9 252 39 44
#252.8 252 54 34
NOTE: You can reset the rownames by rownames(new_df) <- NULL
In base R we could use split<- :
split(df,df$id) <- lapply(split(df,df$id), function(x) x[order(x$var1),] )
or as #Markus suggests :
split(df, df$id) <- by(df, df$id, function(x) x[order(x$var1),])
output in either case :
df
# id var1 var2
# 1 110 21 12
# 2 110 26 234
# 3 110 54 43
# 4 90 9 19
# 5 90 10 32
# 6 90 16 16
# 7 90 18 21
# 8 252 39 44
# 9 252 54 34
With the following tidyverse pipe, the question's output is reproduced.
library(tidyverse)
df %>%
mutate(tmp = cumsum(c(0, diff(id) != 0))) %>%
group_by(id) %>%
arrange(tmp, var1) %>%
select(-tmp)
## A tibble: 9 x 3
## Groups: id [3]
# id var1 var2
# <dbl> <dbl> <dbl>
#1 110 21 12
#2 110 26 234
#3 110 54 43
#4 90 9 19
#5 90 10 32
#6 90 16 16
#7 90 18 21
#8 252 39 44
#9 252 54 34

Put the first row as the column names of my dataframe with dplyr in R

This is my dataframe:
x<-data.frame(A = c(letters[1:10]), M1 = c(11:20), M2 = c(31:40), M3 = c(41:50))
colnames(x)<-NULL
I want to tranpose (t(x)) and consider the first column of x as the colnames of the new dataframe t(x).
Also I need them (the colnames of t(x)) to be identified as words/letters (as character right?)
Is it possible to do this with dplyr package?
Any help?
The {janitor} package is good for this and is flexible enough to be able to select any row to push to column names:
library(tidyverse)
library(janitor)
x <- x %>% row_to_names(row_number = 1)
You can do this easily in base R. Just make the first column of x be the row names, then remove the first column and transpose.
row.names(x) = x[,1]
x = t(x[,-1])
x
a b c d e f g h i j
M1 11 12 13 14 15 16 17 18 19 20
M2 31 32 33 34 35 36 37 38 39 40
M3 41 42 43 44 45 46 47 48 49 50
Try this:
library(dplyr)
library(tidyr)
x <- data.frame(
A = c(letters[1:10]),
M1 = c(11:20),
M2 = c(31:40),
M3 = c(41:50))
x %>%
gather(key = key, value = value, 2:ncol(x)) %>%
spread(key = names(x)[1], value = "value")
key a b c d e f g h i j
1 M1 11 12 13 14 15 16 17 18 19 20
2 M2 31 32 33 34 35 36 37 38 39 40
3 M3 41 42 43 44 45 46 47 48 49 50
I think column_to_rownames from the tibble package would be your simplest solution. Use it before you transpose with t.
library(magrittr)
library(tibble)
x %>%
column_to_rownames("A") %>%
t
#> a b c d e f g h i j
#> M1 11 12 13 14 15 16 17 18 19 20
#> M2 31 32 33 34 35 36 37 38 39 40
#> M3 41 42 43 44 45 46 47 48 49 50
The "M1", "M2", "M3" above are row names. If you want to keep them inside (as a column), you can add rownames_to_column from the same package.
x %>%
column_to_rownames("A") %>%
t %>%
as.data.frame %>%
rownames_to_column("key")
#> key a b c d e f g h i j
#> 1 M1 11 12 13 14 15 16 17 18 19 20
#> 2 M2 31 32 33 34 35 36 37 38 39 40
#> 3 M3 41 42 43 44 45 46 47 48 49 50
Essentially,
column_to_rownames("A") converts column "A" in x to row names,
t transposes the data.frame (now a matrix),
as.data.frame reclassifies it back as a data.frame (which is necessary for the next function), and
rownames_to_column("key") converts the row names into a new column called "key".
Using rownames_to_column() from the tibble package
library(magrittr)
library(tibble)
x %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
rownames_to_column() %>%
`colnames<-`(.[1,]) %>%
.[-1,] %>%
`rownames<-`(NULL)
#> A a b c d e f g h i j
#> 1 M1 11 12 13 14 15 16 17 18 19 20
#> 2 M2 31 32 33 34 35 36 37 38 39 40
#> 3 M3 41 42 43 44 45 46 47 48 49 50
x %>%
`row.names<-`(.[, 1]) %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
.[-1,]
#> a b c d e f g h i j
#> M1 11 12 13 14 15 16 17 18 19 20
#> M2 31 32 33 34 35 36 37 38 39 40
#> M3 41 42 43 44 45 46 47 48 49 50
Created on 2018-10-06 by the reprex package (v0.2.1.9000)

Filter data frame by results from tapply function

I'm trying to apply a tapply function I wrote to filter a dataset. Here is a sample data frame (df) below to describe what I'm trying to do.
I want to keep in my data frame the rows where the value of df$Cumulative_Time is closest to the value of 14. It should do this for each factor level in df$ID (keep row closest the value 14 for each ID factor).
ID Date Results TimeDiff Cumulative_Time
A 7/10/2015 71 0 0
A 8/1/2015 45 20 20
A 8/22/2015 0 18 38
A 9/12/2015 79 17 55
A 10/13/2015 44 26 81
A 11/27/2015 98 37 118
B 7/3/2015 75 0 0
B 7/24/2015 63 18 18
B 8/21/2015 98 24 42
B 9/26/2015 70 30 72
C 8/15/2015 77 0 0
C 9/2/2015 69 15 15
C 9/4/2015 49 2 17
C 9/8/2015 88 2 19
C 9/12/2015 41 4 23
C 9/19/2015 35 6 29
C 10/10/2015 33 18 47
C 10/14/2015 31 3 50
D 7/2/2015 83 0 0
D 7/28/2015 82 22 22
D 8/27/2015 100 26 48
D 9/17/2015 19 17 65
D 10/8/2015 30 18 83
D 12/9/2015 96 51 134
D 1/6/2016 30 20 154
D 2/17/2016 32 36 190
D 3/19/2016 42 27 217
I got as far as the following:
spec_day = 14 # value I want to compare df$Cumulative_Time to
# applying function to calculate closest value to spec_day
tapply(df$Cumulative_Time, df$ID, function(x) which(abs(x - spec_day) == min(abs(x - spec_day))))
Question: how do I include this tapply function as a means to do the filtering of my data frame df? Am I approaching this problem the right way, or is there some simpler way to accomplish this that I'm not seeing? Any help would be appreciated--thanks!
Here's a way you can do it, note that I didn't use tapply:
spec_day <- 14
new_df <- do.call('rbind',
by(df, df$ID,
FUN = function(x) x[which.min(abs(x$Cumulative_Time - spec_day)), ]
))
new_df
ID Date Results TimeDiff Cumulative_Time
A A 8/1/2015 45 20 20
B B 7/24/2015 63 18 18
C C 9/2/2015 69 15 15
D D 7/28/2015 82 22 22
which.min (and its sibling which.max) is a very useful function.
Here's a more concise and faster alternative using data.table:
library(data.table)
setDT(df)[, .SD[which.min(abs(Cumulative_Time - 14))], by = ID]
# ID Date Results TimeDiff Cumulative_Time
#1: A 8/1/2015 45 20 20
#2: B 7/24/2015 63 18 18
#3: C 9/2/2015 69 15 15
#4: D 7/28/2015 82 22 22

How to set multiple columns and selected rows in data table to value from other data table

The Question is somewhat related with this question (How to set multiple columns in a data table to values from different columns in the same data table?).
set.seed(1)
df <- data.frame(matrix(sample(1:100,30),ncol=6))
# X1 X2 X3 X4 X5 X6
#1 27 86 19 43 75 29
#2 37 97 16 88 17 1
#3 57 62 61 83 51 28
#4 89 58 34 32 10 81
#5 20 6 67 63 21 25
library(data.table)
dt <- data.table(df)
df1 <- data.frame(matrix(sample(1:100,30),ncol=6))
df1
# X1 X2 X3 X4 X5 X6
#1 49 64 74 68 39 8
#2 60 75 58 2 88 24
#3 100 11 69 40 35 91
#4 19 67 98 61 97 48
#5 80 38 46 57 6 29
dt1 <- data.table(df1)
This time, I want to change the certain row and column.
dt[1:3, c("X1","X2"), with = F] = dt1[1:3, c("X3","X5"), with = F]
But this one give an error:
Error in `[<-.data.table`(`*tmp*`, 1:3, c("X1", "X2"), with = F, value = list( :
unused argument (with = F)
I will do with the data had many columns. I hope that the name of column should be character at first.
By using the = operator as you do, you are trying assign the values to the desired spots in the data.table. Instead you should update your data.table dt by reference with the := operator inside dt[...]. A slight adaptation of #thelatemail's comment (the second with = FALSE is not needed):
dt[1:3, c("X1","X2") := dt1[1:3, c("X3","X5"), with = FALSE]]

Resources