Reshapping data in R to a singular matrix - r

This can be little difficult on what exactly I want , but I would try my best
Say here is my data in R
R1 R2 R3 R4
a b a a
b d c b
e
I want to reshape the data frame so that it would have the data in kind of a singular matrix form,like this
a b c d e
R1 1 1 0 0 0
R2 0 1 0 1 0
R3 1 0 1 0 0
R4 1 1 0 0 1
I assume this is straight forward as it seems easy but my limited knowledge on R is making this a hassle for me
Thanks for your time

What about this?
un <- sort(unique(c(as.matrix(df))))
res <- apply(df, 2, function(x) un %in% x)
rownames(res) <- un
res[] <- as.numeric(res)
t(res)
a b c d e
R1 1 1 0 0 0
R2 0 1 0 1 0
R3 1 0 1 0 0
R4 1 1 0 0 1

The following uses the plyr library's ldply function which is for transforming a list with the result being a data.frame.
data_as_list = list(R1=c('a', 'b'), R2=c('b', 'd'), R3=c('a', 'c'), R4=c('a', 'b', 'e'))
result <- ldply(data_as_list, function(item) {
sapply(letters[1:5], function(letter) letter %in% item)*1})
Given a list of character vectors, we generate a row of the resulting data.frame from each item in the list by asking whether the first 5 letters (a-e) appear in the vector (item). Multiplying by 1 is a hack to convert a boolean vector to a 1-or-0 integer vector, if that's really what you want.
Results:
.id a b c d e
1 R1 1 1 0 0 0
2 R2 0 1 0 1 0
3 R3 1 0 1 0 0
4 R4 1 1 0 0 1
To fix up the row names:
rownames(result) <- result$.id
result <- result[, -which(colnames(result)=='.id')]
Now you have:
a b c d e
R1 1 1 0 0 0
R2 0 1 0 1 0
R3 1 0 1 0 0
R4 1 1 0 0 1

Base R solution:
data_as_list = list(R1=c('a', 'b'), R2=c('b', 'd'), R3=c('a', 'c'), R4=c('a', 'b', 'e'))
stack(data_as_list)
#-----------
values ind
1 a R1
2 b R1
3 b R2
4 d R2
5 a R3
6 c R3
7 a R4
8 b R4
9 e R4
#---------
xtabs( ~ values+ind, data=stack(data_as_list) )
#-----------
ind
values R1 R2 R3 R4
a 1 0 1 1
b 1 1 0 1
c 0 0 1 0
d 0 1 0 0
e 0 0 0 1
xtabs( ~ ind+values, data=stack(data_as_list) )
#----------
values
ind a b c d e
R1 1 1 0 0 0
R2 0 1 0 1 0
R3 1 0 1 0 0
R4 1 1 0 0 1

Another approach is to use mtabulate from the "qdapTools" package. This will work for either a data.frame or a list... which should make sense, of course :-)
library(qdapTools)
x <- mtabulate(df)
x[] <- as.numeric(x > 0)
x
# V1 a b d c e
# R1 1 1 1 0 0 0
# R2 0 0 1 1 0 0
# R3 1 1 0 0 1 0
# R4 0 1 1 0 0 1
Since there are two "d" values in "R2", we use the as.numeric(x > 0) to convert to just ones and zeroes. You can drop the first column, which has counted the blanks.
I've used the sample data provided by #barerd:
df <- structure(list(R1 = structure(c(2L, 3L, 1L), .Label = c("", "a",
"b"), class = "factor"), R2 = structure(c(2L, 2L, 1L), .Label = c("b",
"d"), class = "factor"), R3 = structure(c(2L, 3L, 1L), .Label = c("",
"a", "c"), class = "factor"), R4 = structure(1:3, .Label = c("a",
"b", "e"), class = "factor")), .Names = c("R1", "R2", "R3", "R4"
), row.names = c(NA, -3L), class = "data.frame")

Here is a possibility. This could be improved to scale better.
matrix(as.numeric(rbind( ae %in% R1,
ae %in% R2,
ae %in% R3,
ae %in% R4)),4,5)

x1<-as.character(grep("[a-z]",unique(unlist(df)),value=TRUE)) #df is data
x2<-data.frame(do.call(rbind,lapply(1:ncol(df),function(i){ifelse(x1 %in% df[,i],1,0)})))
colnames(x2)<-x1
row.names(x2)<-names(df)
x2
a b d c e
R1 1 1 0 0 0
R2 0 1 1 0 0
R3 1 0 0 1 0
R4 1 1 0 0 1

First of all, I suppose this is data from a csv file or a table, which you can read into R with read.table() or read.csv().
And you should put it with dput() like:
structure(list(R1 = structure(c(2L, 3L, 1L), .Label = c("", "a",
"b"), class = "factor"), R2 = structure(c(2L, 2L, 1L), .Label = c("b",
"d"), class = "factor"), R3 = structure(c(2L, 3L, 1L), .Label = c("",
"a", "c"), class = "factor"), R4 = structure(1:3, .Label = c("a",
"b", "e"), class = "factor")), .Names = c("R1", "R2", "R3", "R4"
), row.names = c(NA, -3L), class = "data.frame")
so that we can put it into R easily.
You can rshape your data with the "reshape" library. There are many documents for reshaping data in R, including the help page, but basically you can transpose() your data, so that columns become rows. You can melt() it, so that each row becomes a unique id-variable combination like:
X1 X2 value
1 R1 1 a
2 R2 1 d
3 R3 1 a
4 R4 1 a
5 R1 2 b
6 R2 2 d
7 R3 2 c
8 R4 2 b
9 R1 3
10 R2 3 b
11 R3 3
12 R4 3 e
and then, you can cast(data, formula, function) the melted data into any shape. Since you wanted to see the distribution of the values according to R* stuff, I used the following formula:
t(cast(melt(t(t), id=c("a", "b", "c", "d", "e")), value~X1, ))[, c(2:6]
and got:
a b c d e
R1 1 1 0 0 0
R2 0 1 0 2 0
R3 1 0 1 0 0
R4 1 1 0 0 1

Related

Add new columns to a dataframe containing sum of positive values in a row and sum of negative values in a row - R

I have a dataframe df which looks like this
ID A B C D E F G
1 0 0 1 -1 1 0 0
2 1 1 1 0 0 0 0
3 -1 0 1 0 -1 -1 0
.
.
.
I want to add two column at the end of each row showing the sum of positive values and the sum of negative values so df would look like this
ID A B C D E F G pos neg
1 0 0 1 -1 1 0 0 2 -1
2 1 1 1 0 0 0 0 3 0
3 -1 0 1 0 -1 -1 0 1 -3
.
.
.
I can't figure out how to do this. I have tried the following which turns the df into a list
df$neg <- rowSums(df < 0)
I have also tried the following which throws up an error message:
Error in df[, c("A", "B", "C", :
subscript out of bounds
df$neg <- rowSums(df[, c("A", "B", "C", "D", "E", "F", "G")] < 0)
Any help would be really appreciated, thanks!
We can try this
cbind(
df,
pos = rowSums(df[-1] * (df[-1] > 0)),
neg = rowSums(df[-1] * (df[-1] < 0))
)
which gives
ID A B C D E F G pos neg
1 1 0 0 1 -1 1 0 0 2 -1
2 2 1 1 1 0 0 0 0 3 0
3 3 -1 0 1 0 -1 -1 0 1 -3
Data
> dput(df)
structure(list(ID = 1:3, A = c(0L, 1L, -1L), B = c(0L, 1L, 0L
), C = c(1L, 1L, 1L), D = c(-1L, 0L, 0L), E = 1:-1, F = c(0L,
0L, -1L), G = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
Using dplyr:
df %>%
mutate(pos = rowSums(replace(.[-1],.[-1]<0,0)),
neg = rowSums(replace(.[-1],.[-1]>0,0)))

Subtracting multiple rows from the same row in R

I am looking to subtract multiple rows from the same row within a dataframe.
For example:
Group A B C
A 3 1 2
B 4 0 3
C 4 1 1
D 2 1 2
This is what I want it to look like:
Group A B C
B 1 -1 1
C 1 0 -1
D -1 0 0
So in other words:
Row B - Row A
Row C - Row A
Row D - Row A
Thank you!
Here's a dplyr solution:
library(dplyr)
df %>%
mutate(across(A:C, ~ . - .[1])) %>%
filter(Group != "A")
This gives us:
Group A B C
1: B 1 -1 1
2: C 1 0 -1
3: D -1 0 0
Here's an approach with base R:
data[-1] <- do.call(rbind,
apply(data[-1],1,function(x) x - data[1,-1])
)
data[-1,]
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0
Data:
data <- structure(list(Group = c("A", "B", "C", "D"), A = c(3L, 4L, 4L,
2L), B = c(1L, 0L, 1L, 1L), C = c(2L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-4L))
We could also replicate the first row and substract from the rest
cbind(data[-1, 1, drop = FALSE], data[-1, -1] - data[1, -1][col(data[-1, -1])])
-output
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0

merge multiple columns in one table?

I have a table with several columns, I would like to make a column by combining 'R1,R2 and R3' columns in a table.
DF:
ID R1 T1 R2 T2 R3 T3
rs1 A 1 NA . NA 0
rs21 NA 0 C 1 C 1
rs32 A 1 A 1 A 0
rs25 NA 2 NA 0 A 0
Desired output:
ID R1 T1 R2 T2 R3 T3 New_R
rs1 A 1 NA . NA 0 A
rs21 NA 0 C 1 C 1 C
rs32 A 1 A 1 A 0 A
rs25 NA 2 NA 0 A 0 A
We can use tidyverse
library(tidyverse)
DF %>%
mutate(New_R = pmap_chr(select(., starts_with("R")), ~c(...) %>%
na.omit %>%
unique %>%
str_c(collape="")))
#. ID R1 T1 R2 T2 R3 T3 New_R
#1 rs1 A 1 <NA> . <NA> 0 A
#2 rs21 <NA> 0 C 1 C 1 C
#3 rs32 A 1 A 1 A 0 A
#4 rs25 <NA> 2 <NA> 0 A 0 A
If there is only one non-NA element per row, we can use coalecse
DF %>%
mutate(New_R = coalesce(!!! select(., starts_with("R"))))
Or in base R
DF$New_R <- do.call(pmin, c(DF[grep("^R\\d+", names(DF))], na.rm = TRUE))
data
DF <- structure(list(ID = c("rs1", "rs21", "rs32", "rs25"), R1 = c("A",
NA, "A", NA), T1 = c(1L, 0L, 1L, 2L), R2 = c(NA, "C", "A", NA
), T2 = c(".", "1", "1", "0"), R3 = c(NA, "C", "A", "A"), T3 = c(0L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -4L))
you can use the ifelse function in a nested way:
DF$New_R <- ifelse(!is.na(DF$R1), DF$R1,
ifelse(!is.na(DF$R2), DF$R2,
ifelse(!is.na(DF$R3), DF$R3, NA)))
ifelse takes three arguments, a condition, what to do if the condition is fulfilled, and what to do if the condition is not fulfilled. It can be applied to data frame column treating each raw separately. In my example it will pick the first non NA value found.
We can use apply row-wise, remove NA values and keeping only unique values.
cols <- paste0("R", 1:3)
df$New_R <- apply(df[cols], 1, function(x)
paste0(unique(na.omit(x)), collapse = ""))
df
# ID R1 T1 R2 T2 R3 T3 New_R
#1 rs1 A 1 <NA> . <NA> 0 A
#2 rs21 <NA> 0 C 1 C 1 C
#3 rs32 A 1 A 1 A 0 A
#4 rs25 <NA> 2 <NA> 0 A 0 A

R how to do the partial row sums

I am very new to R, and I sincerely appreciate your help.
The following is part of my data:
subjectID A B C D E F G H I J
S001 1 1 1 1 1 0 0
S002 1 1 1 0 0 0 0
I want to sum the rows from A to J, and so the data will look like this:
subjectID A B C D E F G H I J TOTAL
S001 1 1 1 1 1 0 0 5
S002 1 1 1 0 0 0 0 3
Thank you so much! I would like sum if variable A to J == 1.
As suggested, I post here my answers.
This is is with apply. the df[-1] is to exclude the first column (which is not numeric), the x[x == 1] is to subset the elements of x (a single row due to the 1 of the apply) with only values of 1.
df$TOTAL <- apply(df[-1], 1, function(x) sum(x[x == 1], na.rm = T))
Another (I bet much faster and) easier to code way in base R is:
df$TOTAL <- rowSums(df[-1] == 1, na.rm = T)
both have as a result this
df
subjectID A B C D E F G H I J TOTAL
1 S001 1 1 1 1 1 0 0 NA NA NA 5
2 S002 1 1 1 0 0 0 0 NA NA NA 3
Data
df <- structure(list(subjectID = structure(1:2, .Label = c("S001",
"S002"), class = "factor"), A = c(1L, 1L), B = c(1L, 1L), C = c(1L,
1L), D = c(1L, 0L), E = c(1L, 0L), F = c(0L, 0L), G = c(0L, 0L
), H = c(NA, NA), I = c(NA, NA), J = c(NA, NA)), .Names = c("subjectID",
"A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), class = "data.frame", row.names = c(NA,
-2L))
Another similar option to the one posted by SabDeM but using sapply to sum only numeric columns
df$Total <- rowSums(df[ ,sapply(df, is.numeric)])
Output:
subjectID A B C D E F G H I J Total
1 S001 1 1 1 1 1 0 0 NA NA NA 5
2 S002 1 1 1 0 0 0 0 NA NA NA 3

Converting a large long data to wide in R [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I need help with converting my long data of dimension 1558810 x 84 to a wide data of 1558810 x 4784
Let me explain in detail how and why. My raw data is as follows -
The data has three main columns -
id empId dept
1 a social
2 a Hist
3 a math
4 b comp
5 a social
6 b comp
7 c math
8 c Hist
9 b math
10 a comp
id is the unique key that tells which employee went to which department in a university on a day. I need this to be transformed as below.
id empId dept social Hist math comp
1 a social 1 0 0 0
2 a Hist 0 1 0 0
3 a math 0 0 1 0
4 b comp 0 0 0 1
5 a social 1 0 0 0
6 b comp 0 0 0 1
7 c math 0 0 1 0
8 c Hist 0 1 0 0
9 b math 0 0 1 0
10 a comp 0 0 0 1
I have two datasets one with 49k rows and one with 1.55million rows. For the smaller dataset which had 1100 unique department values, I used dcast in the reshape2 package to get the desired dataset(thus, transformed data would have 3+1100 columns and 49k rows). But when I use the same function on my larger dataset that has 4700 unique department values, my R crashes because of Memory issue. I tried varous other alternative like xtabs, reshape etc. but every time it failed with Memory error.
I have now resorted to a crude FOR loop for this purpose -
columns <- unique(ds$dept)
for(i in 1:length(unique(ds$dept)))
{
ds[,columns[i]] <- ifelse(ds$dept==columns[i],1,0)
}
But this is extremely slow and the code has been running for 10 hrs now. Is there any workaround for this, that I am missing?
ANy suggestions would be of great help!
You could try
df$dept <- factor(df$dept, levels=unique(df$dept))
res <- cbind(df, model.matrix(~ 0+dept, df))
colnames(res) <- gsub("dept(?=[A-Za-z])", "", colnames(res), perl=TRUE)
res
# id empId dept social Hist math comp
#1 1 a social 1 0 0 0
#2 2 a Hist 0 1 0 0
#3 3 a math 0 0 1 0
#4 4 b comp 0 0 0 1
#5 5 a social 1 0 0 0
#6 6 b comp 0 0 0 1
#7 7 c math 0 0 1 0
#8 8 c Hist 0 1 0 0
#9 9 b math 0 0 1 0
#10 10 a comp 0 0 0 1
Or you could try
cbind(df, as.data.frame.matrix(table(df[,c(1,3)])))
Or using data.table
library(data.table)
setDT(df)
dcast.data.table(df, id + empId + dept ~ dept, fun=length)
Or using qdap
library(qdap)
cbind(df, as.wfm(with(df, mtabulate(setNames(dept, id)))))
data
df <- structure(list(id = 1:10, empId = c("a", "a", "a", "b", "a",
"b", "c", "c", "b", "a"), dept = c("social", "Hist", "math",
"comp", "social", "comp", "math", "Hist", "math", "comp")), .Names = c("id",
"empId", "dept"), class = "data.frame", row.names = c(NA, -10L))
Try:
> cbind(dd[1:3], dcast(dd, dd$id~dd$dept, length)[-1])
Using dept as value column: use value.var to override.
id empId dept comp Hist math social
1 1 a social 0 0 0 1
2 2 a Hist 0 1 0 0
3 3 a math 0 0 1 0
4 4 b comp 1 0 0 0
5 5 a social 0 0 0 1
6 6 b comp 1 0 0 0
7 7 c math 0 0 1 0
8 8 c Hist 0 1 0 0
9 9 b math 0 0 1 0
10 10 a comp 1 0 0 0
data:
> dput(dd)
structure(list(id = 1:10, empId = structure(c(1L, 1L, 1L, 2L,
1L, 2L, 3L, 3L, 2L, 1L), .Label = c("a", "b", "c"), class = "factor"),
dept = structure(c(4L, 2L, 3L, 1L, 4L, 1L, 3L, 2L, 3L, 1L
), .Label = c("comp", "Hist", "math", "social"), class = "factor")), .Names = c("id",
"empId", "dept"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10"))

Resources