R: Transforming (n*P) * N data frame into n * (N*P) - r

I'm using R and I have a data frame called df which has (n*P) rows and N columns.
C1 C2 ... CN-1 CN
1-1 100 36 ... 136 76
1-2 120 -33 ... 87 42
1-3 150 14 ... 164 24
:
1-n 20 36 ... 136 76
2-1 109 26 ... 166 87
2-2 -33 87 ... 42 24
2-3 100 36 ... 136 76
:
2-n 100 36 ... 136 76
:
P-1 150 14 ... 164 24
P-2 100 36 ... 765 76
P-3 150 14 ... 164 94
:
P-n 10 26 ... 106 76
And I want to transform this data frame into a data frame with n rows and (N*P) columns. The new data frame, df.new, should look like
C1-1 C2-1 ... CN-1-1 CN-1 C1-2 C2-2 ... CN-1-2 CN-2 ... C1-P C2-P ... CN-1-P CN-P
R1 100 36 ... 136 76 20 36 ... 136 76 ... 150 14 ... 164 24
R2 120 -33 ... 87 42 109 26 ... 166 87 ... 100 36 ... 765 76
:
:
Rn 20 36 ... 136 76 100 36 ... 136 76 ... 10 26 ... 106 76
That is to say, the first N columns of df.new are rbind of rows 1-1, 2-1, 3-1, ... , P-1 of df. The next N columns of df.new are rbind of rows 1-2, 2-2, 3-2, ... , P-2 of df. It follows till the last N columns of df.new which will be composed of rows rows 1-n, 2-n, 3-n, ... , P-n of df. (R1 of df.new is cbind of rows 1-1, 1-2,...,1-n. R2 of df.new is cbind of rows 2-1, 2-2,...,2-n. Rn of df.new is cbind of rows P-1, P-2,...,P-n.)
n, P and N are variables so the value of them depend on the case. I tried to create df.new using for loops but doesn't work well.
Here is my try which I kind of gave up.
for (j in 1:n) {
df.new <- data.frame(matrix(vector(), 1, dim(df)[2],
dimnames = list(c(), colnames(df))),
stringsAsFactors=F)
for (i in 1:nrow(df)) {
if (i %% n == 0) {
df.new <- rbind(df.new, df[i,])
} else if (i %% n == j) {
df.new <- rbind(df.new, df[i,])
}
}
assign(paste0("df.new", j), df.new)
}

library(dplyr)
library(tidyr)
library(tibble)
df %>%
rownames_to_column("rowname") %>%
separate(rowname, c("rowname_prefix", "rowname_suffix"), "-") %>%
gather(col_name, value, -rowname_prefix, -rowname_suffix) %>%
mutate(col_name = paste(col_name, rowname_prefix, sep="-")) %>%
select(-rowname_prefix) %>%
spread(col_name, value) %>%
mutate(rowname_suffix = paste0("R", rowname_suffix)) %>%
column_to_rownames("rowname_suffix")
Output is:
C1-1 C1-2 C1-3 C2-1 C2-2 C2-3 C3-1 C3-2 C3-3 C4-1 C4-2 C4-3
R1 100 109 150 36 26 14 136 166 164 76 87 24
R2 120 -33 100 -33 87 36 87 42 765 42 24 76
R3 150 100 150 14 36 14 164 136 164 24 76 94
R4 20 100 10 36 36 26 136 136 106 76 76 76
Sample data:
df <- structure(list(C1 = c(100L, 120L, 150L, 20L, 109L, -33L, 100L,
100L, 150L, 100L, 150L, 10L), C2 = c(36L, -33L, 14L, 36L, 26L,
87L, 36L, 36L, 14L, 36L, 14L, 26L), C3 = c(136L, 87L, 164L, 136L,
166L, 42L, 136L, 136L, 164L, 765L, 164L, 106L), C4 = c(76L, 42L,
24L, 76L, 87L, 24L, 76L, 76L, 24L, 76L, 94L, 76L)), .Names = c("C1",
"C2", "C3", "C4"), class = "data.frame", row.names = c("1-1",
"1-2", "1-3", "1-4", "2-1", "2-2", "2-3", "2-4", "3-1", "3-2",
"3-3", "3-4"))
# C1 C2 C3 C4
#1-1 100 36 136 76
#1-2 120 -33 87 42
#1-3 150 14 164 24
#1-4 20 36 136 76
#2-1 109 26 166 87
#2-2 -33 87 42 24
#2-3 100 36 136 76
#2-4 100 36 136 76
#3-1 150 14 164 24
#3-2 100 36 765 76
#3-3 150 14 164 94
#3-4 10 26 106 76

Related

Creating differences in a new column for certain dates in R

i have a data frame that looks like this;
Date Value1 Value 2 Value 3
1997Q1 100 130 120
1997Q1 100 130 124
1997Q1 120 136 154
1997Q2 180 145 154
1997Q2 186 134 126
1997Q2 186 124 176
1997Q3 190 143 176
1997Q3 192 143 123
I would like to calculate differences for each values within the same date, for example the differences in value 1 column for 1997q1, then 1997q2 and so on.
I would like these differences to be shown in a new column, so that the results would look something like this;
Date Value1 Value 2 Value 3 Diff Val 1 Diff Val 2 Diff Val 3
1997Q1 100 130 120 0 0 4
1997Q1 100 130 124 20 6 30
1997Q1 120 136 154 N/A N/A N/A
1997Q2 180 145 154 6 -11 -28
1997Q2 186 134 126 0 10 50
1997Q2 186 124 176 N/A N/A N/A
1997Q3 190 143 176 2 0 -53
1997Q3 192 143 123
You can use dplyr functions for this. The ~ .x - lead(.x) is the function applied to every value column, selected with starts_with. we take the current value minus the next value. If you need lag, switch it around, ~ lag(.x) - .x
library(dplyr)
df1 %>%
group_by(Date) %>%
mutate(across(starts_with("Value"), ~.x - lead(.x), .names = "diff_{.col}"))
if the values are numeric and the column names are not easily found, you can use mutate(across(where(is.numeric), ~.x - lead(.x), .names = "diff_{.col}")).
# A tibble: 8 × 7
# Groups: Date [3]
Date Value1 Value2 Value3 diff_Value1 diff_Value2 diff_Value3
<chr> <int> <int> <int> <int> <int> <int>
1 1997Q1 100 130 120 0 0 -4
2 1997Q1 100 130 124 -20 -6 -30
3 1997Q1 120 136 154 NA NA NA
4 1997Q2 180 145 154 -6 11 28
5 1997Q2 186 134 126 0 10 -50
6 1997Q2 186 124 176 NA NA NA
7 1997Q3 190 143 176 -2 0 53
8 1997Q3 192 143 123 NA NA NA
data:
df1 <- structure(list(Date = c("1997Q1", "1997Q1", "1997Q1", "1997Q2",
"1997Q2", "1997Q2", "1997Q3", "1997Q3"), Value1 = c(100L, 100L,
120L, 180L, 186L, 186L, 190L, 192L), Value2 = c(130L, 130L, 136L,
145L, 134L, 124L, 143L, 143L), Value3 = c(120L, 124L, 154L, 154L,
126L, 176L, 176L, 123L)), class = "data.frame", row.names = c(NA,
-8L))

Lapply incorrect number of dimensions in a list of data frames..?

I have a large list of several hundreds of data frames and trying to filter rows from between two values containing a pattern VALUE1 and VALUE2 in the column Z. Like this:
weight | height | Z
---------------------------
62 100 NA
65 89 NA
59 88 randomnumbersVALUE1randomtext
66 92 NA
64 90 NA
64 87 randomnumbersVALUE2randomtext
57 84 NA
68 99 NA
59 82 NA
60 87 srebmunmodnarVALUE1txetmodnar
61 86 NA
63 84 srebmunmodnarVALUE2txetmodnar
And after filtering I would get:
59 88 randomnumbersVALUE1randomtext
66 92 NA
64 90 NA
64 87 randomnumbersVALUE2randomtext
60 87 srebmunmodnarVALUE1txetmodnar
61 86 NA
63 84 srebmunmodnarVALUE2txetmodnar
The code I'm using is:
lapply(df, function(x){
start <- which(grepl("VALUE1", x$Z))
end <- which(grepl("VALUE2", x$Z))
rows <- unlist(lapply(seq_along(start), function(y){start[y]:end[y]}))
return(df[rows,])})
But whenever I try to run the script, I get an error message saying:
Error in df[rows, ] : incorrect number of dimensions
Why does this happen and how can I get around it..?
EDIT: Added a minimal sample data of the actual datasheet (the first data frame and first element of the list, VALUE2 will follow VALUE 1 always at some point)
> head(tbl[[1]])
# A tibble: 6 × 4
t speed off Z
<dbl> <dbl> <dbl> <chr>
1 27.3 27.8 0.485 "{\"type\":\"M\",\"msg\":\"VALUE1\",\"time\":27.2498,\"dist\":0.410454}"
2 27.4 27.8 0.457 NA
3 27.5 27.8 0.430 NA
4 27.6 27.8 0.402 NA
5 27.7 27.8 0.374 NA
6 27.8 27.8 0.347 NA
Assuming there are equal number of 'VALUE1', 'VALUE2', get the position index of 'VALUE1', 'VALUE2', separately with grep, create a sequence (:) by looping over the corresponding positions in Map, unlist and use the sequence to subset the data
df1[sort(unique(unlist(Map(`:`, grep("VALUE1", df1$Z),
grep("VALUE2", df1$Z))))),]
-output
weight height Z
3 59 88 randomnumbersVALUE1randomtext
4 66 92 <NA>
5 64 90 <NA>
6 64 87 randomnumbersVALUE2randomtext
10 60 87 srebmunmodnarVALUE1txetmodnar
11 61 86 <NA>
12 63 84 srebmunmodnarVALUE2txetmodnar
If the df is a single data.frame, when we loop over the data.frame with lapply, it will be looping over the columns and thus each list element is a vector. Therefore, there is no x$Z. Each x will be the corresponding column
If it is a list, then the error can occur when there are cases with no 'VALUE1' or 'VALUE2' or if the number of 'VALUE1' matches are not equal to 'VALUE2'. It may be better to check those elements before doing the :
data
df1 <- structure(list(weight = c(62L, 65L, 59L, 66L, 64L, 64L, 57L,
68L, 59L, 60L, 61L, 63L), height = c(100L, 89L, 88L, 92L, 90L,
87L, 84L, 99L, 82L, 87L, 86L, 84L), Z = c(NA, NA,
"randomnumbersVALUE1randomtext",
NA, NA, "randomnumbersVALUE2randomtext", NA, NA, NA,
"srebmunmodnarVALUE1txetmodnar",
NA, "srebmunmodnarVALUE2txetmodnar")),
class = "data.frame", row.names = c(NA,
-12L))

How can I delete "a lot" of rows from a dataframe in r

I tried all the similar posts but none of the answers seemed to work for me. I want to delete 8500+ rows (by rowname only) from a dataframe with 27,000+. The other columns are completely different, but the smaller dataset was derived from the larger one, and just looking for names shows me that whatever I look for from smaller df it is present in larger df. I could of course do this manually (busy work for sure!), but seems like there should be a simple computational answer.
I have tried:
fordel<-df2[1,]
df3<-df1[!rownames(df1) %in% fordel
l1<- as.vector(df2[1,])
df3<- df1[1-c(l1),]
and lots of other crazy ideas!
Here is a smallish example: df1:
Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
ENSMUSG00000000001.4 10634 6954 6835 6510
ENSMUSG00000000003.15 0 0 0 0
ENSMUSG00000000028.14 559 1570 807 1171
ENSMUSG00000000031.15 5748 174 4103 146
ENSMUSG00000000037.16 37 194 49 96
ENSMUSG00000000049.11 0 3 1 0
ENSMUSG00000000056.7 1157 1125 806 947
ENSMUSG00000000058.6 75 304 123 169
ENSMUSG00000000078.6 4012 4391 5637 3854
ENSMUSG00000000085.16 381 560 482 368
ENSMUSG00000000088.6 2667 4777 3483 3450
ENSMUSG00000000093.6 3 48 41 22
ENSMUSG00000000094.12 23 201 102 192
df2
structure(list(base_mean = c(7962.408875, 947.1240794, 43.76698418 ), log2foldchange = c(-0.363434063, -0.137403759, -0.236463207 ), lfcSE = c(0.096816743, 0.059823215, 0.404929452), stat = c(-3.753834854, -2.296830066, -0.583961493)), row.names = c("ENSMUSG00000000001.4", "ENSMUSG00000000056.7", "ENSMUSG00000000093.6"), class = "data.frame")
I want to delete from df1 the rows corresponding to the rownames in df2.
Tried to format it, but seems no longer formatted... oh well....
Suggestions really appreciated!
You mentioned row names but your data does not include that, so I'll assume that they really don't matter (or exist). Also, your df2 has more column headers than columns, not sure what's going on there ... so I'll ignore it.
Data
df1 <- structure(list(Ent_gene_id = c("ENSMUSG00000000001.4", "ENSMUSG00000000003.15",
"ENSMUSG00000000028.14", "ENSMUSG00000000031.15", "ENSMUSG00000000037.16",
"ENSMUSG00000000049.11", "ENSMUSG00000000056.7", "ENSMUSG00000000058.6",
"ENSMUSG00000000078.6", "ENSMUSG00000000085.16", "ENSMUSG00000000088.6",
"ENSMUSG00000000093.6", "ENSMUSG00000000094.12"), clone57_RNA = c(10634L,
0L, 559L, 5748L, 37L, 0L, 1157L, 75L, 4012L, 381L, 2667L, 3L,
23L), clone43_RNA_2 = c(6954L, 0L, 1570L, 174L, 194L, 3L, 1125L,
304L, 4391L, 560L, 4777L, 48L, 201L), clone67_RNA = c(6835L,
0L, 807L, 4103L, 49L, 1L, 806L, 123L, 5637L, 482L, 3483L, 41L,
102L), clone55_RNA = c(6510L, 0L, 1171L, 146L, 96L, 0L, 947L,
169L, 3854L, 368L, 3450L, 22L, 192L)), class = "data.frame", row.names = c(NA,
-13L))
df2 <- structure(list(Ent_gene_id = c("ENSMUSG00000000001.4", "ENSMUSG00000000056.7",
"ENSMUSG00000000093.6"), base_mean = c(7962.408875, 947.1240794,
43.76698418), log2foldchange = c(-0.36343406, -0.137403759, -0.236463207
), pvalue = c(0.00017415, 0.021628466, 0.55924622)), class = "data.frame", row.names = c(NA,
-3L))
Base
df1[!df1$Ent_gene_id %in% df2$Ent_gene_id,]
# Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# 2 ENSMUSG00000000003.15 0 0 0 0
# 3 ENSMUSG00000000028.14 559 1570 807 1171
# 4 ENSMUSG00000000031.15 5748 174 4103 146
# 5 ENSMUSG00000000037.16 37 194 49 96
# 6 ENSMUSG00000000049.11 0 3 1 0
# 8 ENSMUSG00000000058.6 75 304 123 169
# 9 ENSMUSG00000000078.6 4012 4391 5637 3854
# 10 ENSMUSG00000000085.16 381 560 482 368
# 11 ENSMUSG00000000088.6 2667 4777 3483 3450
# 13 ENSMUSG00000000094.12 23 201 102 192
dplyr
dplyr::anti_join(df1, df2, by = "Ent_gene_id")
# Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# 1 ENSMUSG00000000003.15 0 0 0 0
# 2 ENSMUSG00000000028.14 559 1570 807 1171
# 3 ENSMUSG00000000031.15 5748 174 4103 146
# 4 ENSMUSG00000000037.16 37 194 49 96
# 5 ENSMUSG00000000049.11 0 3 1 0
# 6 ENSMUSG00000000058.6 75 304 123 169
# 7 ENSMUSG00000000078.6 4012 4391 5637 3854
# 8 ENSMUSG00000000085.16 381 560 482 368
# 9 ENSMUSG00000000088.6 2667 4777 3483 3450
# 10 ENSMUSG00000000094.12 23 201 102 192
Edit: same thing but with row names:
# update my df1 to change Ent_gene_id from a column to rownames
rownames(df1) <- df1$Ent_gene_id
df1$Ent_gene_id <- NULL
# use your updated df2 (from dput)
# df2 <- structure(...)
df1[ !rownames(df1) %in% rownames(df2), ]
# clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# ENSMUSG00000000003.15 0 0 0 0
# ENSMUSG00000000028.14 559 1570 807 1171
# ENSMUSG00000000031.15 5748 174 4103 146
# ENSMUSG00000000037.16 37 194 49 96
# ENSMUSG00000000049.11 0 3 1 0
# ENSMUSG00000000058.6 75 304 123 169
# ENSMUSG00000000078.6 4012 4391 5637 3854
# ENSMUSG00000000085.16 381 560 482 368
# ENSMUSG00000000088.6 2667 4777 3483 3450
# ENSMUSG00000000094.12 23 201 102 192

R- create new columns based on levels of a freq table variable

Hi I am new to R so please bear with me,
I have my data arranged like so,
Length Seq X
28 GTGCACCGCAAGTGCTTCTAAGAAGGAT 19
28 TGCACCGCAAGTGCTTCTAAGAAGGATC 18
29 GTGCACCGCAAGTGCTTCTAAGAAGGATC 19
29 GTGCACCGCAAGTGCTTCTAAGAAGGATC 19
and I used
count(dF, vars=c("Length", "X"))
to generate a freq table that looks like:
Length X freq
28 15 160
28 16 163
28 17 21
29 15 198
29 16 410
29 17 104
How can I rearrange the data so that it looks something like this?
Length 15 16 17 total
28 160 163 21 344
29 198 410 104 712
30 205 614 393 1212
Tot 2746 6564 2012 11322
(I know these values are wrong)
If you want it to look like your example:
# your data
df<- data.frame(Length = c(28, 28, 28, 29, 29, 29),
X = c(15, 16, 17, 15, 16, 17),
freq = c(160, 163, 21, 198, 410, 104))
use this function
require(reshape)
tabler <- function(a){
b <- cast(a, Length~X)
b <- cbind(b, rowSums(b))
b <- rbind(b, colSums(b))
colnames(b)[ncol(b)] <- b[nrow(b),1] <- "total"
return(b)
}
tabler(df)
returns:
Length 15 16 17 total
1 28 160 163 21 344
2 29 198 410 104 712
3 total 358 573 125 1056
A base R option is
addmargins(xtabs(freq~Length+X, df1))
# X
#Length 15 16 17 Sum
# 28 160 163 21 344
# 29 198 410 104 712
# Sum 358 573 125 1056
data
df1 <- structure(list(Length = c(28L, 28L, 28L, 29L, 29L, 29L),
X = c(15L,
16L, 17L, 15L, 16L, 17L), freq = c(160L, 163L, 21L, 198L, 410L,
104L)), .Names = c("Length", "X", "freq"), class = "data.frame",
row.names = c(NA, -6L))

Grouping the dataframe based on one variable

I have a dataframe with 10 variables all of them numeric, and one of the variable name is age, I want to group the observation based on age.example. age 17 to 18 one group, 19-22 another group and then each row should be attached to each group. And resulting should be a dataframe for further manipulations.
Model of the dataframe:
A B AGE
25 50 17
30 42 22
50 60 19
65 105 17
355 400 21
68 47 20
115 98 18
25 75 19
And I want result like
17-18
A B AGE
25 50 17
65 105 17
115 98 18
19-22
A B AGE
30 42 22
50 60 19
355 400 21
68 47 20
115 98 18
25 75 19
I did group the dataset according to Age var using the split function, now my concern is how I could manipulate the grouped data. Eg:the answer looked like
$1
A B AGE
25 50 17
65 105 17
115 98 18
$2
A B AGE
30 42 22
50 60 19
355 400 21
68 47 20
115 98 18
25 75 19
My question is how can I access each group for further manipulation?
for eg: if I want to do t-test for each group separately?
The split function will work with dataframes. Use either cut with 'breaks' or findInterval with an appropriate set of cutpoints (named 'vec' if you are using named parameters) as the criterion for grouping, the second argument to split. The default for cut is intervals closed on the right and default for findInterval is closed on the left.
> split(dat, findInterval(dat$AGE, c(17, 19.5, 22.5)))
$`1`
A B AGE
1 25 50 17
3 50 60 19
4 65 105 17
7 115 98 18
8 25 75 19
$`2`
A B AGE
2 30 42 22
5 355 400 21
6 68 47 20
Here is the approach with cut
lst <- split(df1, cut(df1$AGE, breaks=c(16, 18, 22), labels=FALSE))
lst
# $`1`
# A B AGE
#1 25 50 17
#4 65 105 17
#7 115 98 18
#$`2`
# A B AGE
#2 30 42 22
#3 50 60 19
#5 355 400 21
#6 68 47 20
#8 25 75 19
Update
If you need to find the sum, mean of columns for each "list" element
lapply(lst, function(x) rbind(colSums(x[-3]),colMeans(x[-3])))
But, if the objective is to find the summary statistics based on the group, it can be done using any of the aggregating functions
library(dplyr)
df1 %>%
group_by(grp=cut(AGE, breaks=c(16, 18, 22), labels=FALSE)) %>%
summarise_each(funs(sum=sum(., na.rm=TRUE),
mean=mean(., na.rm=TRUE)), A:B)
# grp A_sum B_sum A_mean B_mean
#1 1 205 253 68.33333 84.33333
#2 2 528 624 105.60000 124.80000
Or using aggregate from base R
do.call(data.frame,
aggregate(cbind(A,B)~cbind(grp=cut(AGE, breaks=c(16, 18, 22),
labels=FALSE)), df1, function(x) c(sum=sum(x), mean=mean(x))))
data
df1 <- structure(list(A = c(25L, 30L, 50L, 65L, 355L, 68L, 115L, 25L
), B = c(50L, 42L, 60L, 105L, 400L, 47L, 98L, 75L), AGE = c(17L,
22L, 19L, 17L, 21L, 20L, 18L, 19L)), .Names = c("A", "B", "AGE"
), class = "data.frame", row.names = c(NA, -8L))

Resources