Converting a large long data to wide in R [duplicate] - r

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I need help with converting my long data of dimension 1558810 x 84 to a wide data of 1558810 x 4784
Let me explain in detail how and why. My raw data is as follows -
The data has three main columns -
id empId dept
1 a social
2 a Hist
3 a math
4 b comp
5 a social
6 b comp
7 c math
8 c Hist
9 b math
10 a comp
id is the unique key that tells which employee went to which department in a university on a day. I need this to be transformed as below.
id empId dept social Hist math comp
1 a social 1 0 0 0
2 a Hist 0 1 0 0
3 a math 0 0 1 0
4 b comp 0 0 0 1
5 a social 1 0 0 0
6 b comp 0 0 0 1
7 c math 0 0 1 0
8 c Hist 0 1 0 0
9 b math 0 0 1 0
10 a comp 0 0 0 1
I have two datasets one with 49k rows and one with 1.55million rows. For the smaller dataset which had 1100 unique department values, I used dcast in the reshape2 package to get the desired dataset(thus, transformed data would have 3+1100 columns and 49k rows). But when I use the same function on my larger dataset that has 4700 unique department values, my R crashes because of Memory issue. I tried varous other alternative like xtabs, reshape etc. but every time it failed with Memory error.
I have now resorted to a crude FOR loop for this purpose -
columns <- unique(ds$dept)
for(i in 1:length(unique(ds$dept)))
{
ds[,columns[i]] <- ifelse(ds$dept==columns[i],1,0)
}
But this is extremely slow and the code has been running for 10 hrs now. Is there any workaround for this, that I am missing?
ANy suggestions would be of great help!

You could try
df$dept <- factor(df$dept, levels=unique(df$dept))
res <- cbind(df, model.matrix(~ 0+dept, df))
colnames(res) <- gsub("dept(?=[A-Za-z])", "", colnames(res), perl=TRUE)
res
# id empId dept social Hist math comp
#1 1 a social 1 0 0 0
#2 2 a Hist 0 1 0 0
#3 3 a math 0 0 1 0
#4 4 b comp 0 0 0 1
#5 5 a social 1 0 0 0
#6 6 b comp 0 0 0 1
#7 7 c math 0 0 1 0
#8 8 c Hist 0 1 0 0
#9 9 b math 0 0 1 0
#10 10 a comp 0 0 0 1
Or you could try
cbind(df, as.data.frame.matrix(table(df[,c(1,3)])))
Or using data.table
library(data.table)
setDT(df)
dcast.data.table(df, id + empId + dept ~ dept, fun=length)
Or using qdap
library(qdap)
cbind(df, as.wfm(with(df, mtabulate(setNames(dept, id)))))
data
df <- structure(list(id = 1:10, empId = c("a", "a", "a", "b", "a",
"b", "c", "c", "b", "a"), dept = c("social", "Hist", "math",
"comp", "social", "comp", "math", "Hist", "math", "comp")), .Names = c("id",
"empId", "dept"), class = "data.frame", row.names = c(NA, -10L))

Try:
> cbind(dd[1:3], dcast(dd, dd$id~dd$dept, length)[-1])
Using dept as value column: use value.var to override.
id empId dept comp Hist math social
1 1 a social 0 0 0 1
2 2 a Hist 0 1 0 0
3 3 a math 0 0 1 0
4 4 b comp 1 0 0 0
5 5 a social 0 0 0 1
6 6 b comp 1 0 0 0
7 7 c math 0 0 1 0
8 8 c Hist 0 1 0 0
9 9 b math 0 0 1 0
10 10 a comp 1 0 0 0
data:
> dput(dd)
structure(list(id = 1:10, empId = structure(c(1L, 1L, 1L, 2L,
1L, 2L, 3L, 3L, 2L, 1L), .Label = c("a", "b", "c"), class = "factor"),
dept = structure(c(4L, 2L, 3L, 1L, 4L, 1L, 3L, 2L, 3L, 1L
), .Label = c("comp", "Hist", "math", "social"), class = "factor")), .Names = c("id",
"empId", "dept"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10"))

Related

Reshaping long to wide R with categorical variables

I have a data frame that looks like the following, with year and ID identifiers, as well as many categorical variables (values denoted with capital letters below):
Year ID Var1 Var2 Var3 ...
1996 1 A A B
1996 1 A A C
1996 2 B A D
1998 2 C C A
2000 3 D D D
My goal is to reshape this into wide format by ID, but also giving counts for ID, year, and value. So, for example:
ID Var1_1996_A Var1_1996_B Var1_1996_C Var1_1996_D ...
1 2 0 0 0
2 0 1 0 0
3 0 0 0 0
And so on, for each variable. I'm relatively new to R and couldn't quite find a similar operation from existing posts (apologies if this is duplicate). Would anyone know what the best way to accomplish this would be? I have tried using tidyr::pivot_wider, but can only figure out how to append the years, but not create separate categories for each variable response
df <- df %>%
pivot_wider(names_from = year,
values_from (Var1, Var2, Var3, Var4, Var5)
If anyone could offer some insight that would be greatly appreciated.
Get the data in long format first :
library(tidyr)
df %>%
pivot_longer(cols = starts_with('Var')) %>%
pivot_wider(names_from = c(name, Year, value), values_from = name,
values_fn = length, values_fill = 0)
# ID Var1_1996_A Var2_1996_A Var3_1996_B Var3_1996_C Var1_1996_B Var3_1996_D
# <int> <int> <int> <int> <int> <int> <int>
#1 1 2 2 1 1 0 0
#2 2 0 1 0 0 1 1
#3 3 0 0 0 0 0 0
# … with 6 more variables: Var1_1998_C <int>, Var2_1998_C <int>,
# Var3_1998_A <int>, Var1_2000_D <int>, Var2_2000_D <int>, Var3_2000_D <int>
data
df <- structure(list(Year = c(1996L, 1996L, 1996L, 1998L, 2000L), ID = c(1L,
1L, 2L, 2L, 3L), Var1 = c("A", "A", "B", "C", "D"), Var2 = c("A",
"A", "A", "C", "D"), Var3 = c("B", "C", "D", "A", "D")),
class = "data.frame", row.names = c(NA, -5L))
If you will be using base R:
xtabs(~ID+v, transform(cbind(df[1:2], stack(df, -(1:2))), v = paste(ind, Year, values, sep="_")))
v
ID Var1_1996_A Var1_1996_B Var1_1998_C Var1_2000_D Var2_1996_A Var2_1998_C Var2_2000_D Var3_1996_B Var3_1996_C Var3_1996_D Var3_1998_A Var3_2000_D
1 2 0 0 0 2 0 0 1 1 0 0 0
2 0 1 1 0 1 1 0 0 0 1 1 0
3 0 0 0 1 0 0 1 0 0 0 0 1
Of course to transform it to data.frame you could use: as.data.frame.matrix(...)

Subtracting multiple rows from the same row in R

I am looking to subtract multiple rows from the same row within a dataframe.
For example:
Group A B C
A 3 1 2
B 4 0 3
C 4 1 1
D 2 1 2
This is what I want it to look like:
Group A B C
B 1 -1 1
C 1 0 -1
D -1 0 0
So in other words:
Row B - Row A
Row C - Row A
Row D - Row A
Thank you!
Here's a dplyr solution:
library(dplyr)
df %>%
mutate(across(A:C, ~ . - .[1])) %>%
filter(Group != "A")
This gives us:
Group A B C
1: B 1 -1 1
2: C 1 0 -1
3: D -1 0 0
Here's an approach with base R:
data[-1] <- do.call(rbind,
apply(data[-1],1,function(x) x - data[1,-1])
)
data[-1,]
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0
Data:
data <- structure(list(Group = c("A", "B", "C", "D"), A = c(3L, 4L, 4L,
2L), B = c(1L, 0L, 1L, 1L), C = c(2L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-4L))
We could also replicate the first row and substract from the rest
cbind(data[-1, 1, drop = FALSE], data[-1, -1] - data[1, -1][col(data[-1, -1])])
-output
# Group A B C
#2 B 1 -1 1
#3 C 1 0 -1
#4 D -1 0 0

Reshapping data in R to a singular matrix

This can be little difficult on what exactly I want , but I would try my best
Say here is my data in R
R1 R2 R3 R4
a b a a
b d c b
e
I want to reshape the data frame so that it would have the data in kind of a singular matrix form,like this
a b c d e
R1 1 1 0 0 0
R2 0 1 0 1 0
R3 1 0 1 0 0
R4 1 1 0 0 1
I assume this is straight forward as it seems easy but my limited knowledge on R is making this a hassle for me
Thanks for your time
What about this?
un <- sort(unique(c(as.matrix(df))))
res <- apply(df, 2, function(x) un %in% x)
rownames(res) <- un
res[] <- as.numeric(res)
t(res)
a b c d e
R1 1 1 0 0 0
R2 0 1 0 1 0
R3 1 0 1 0 0
R4 1 1 0 0 1
The following uses the plyr library's ldply function which is for transforming a list with the result being a data.frame.
data_as_list = list(R1=c('a', 'b'), R2=c('b', 'd'), R3=c('a', 'c'), R4=c('a', 'b', 'e'))
result <- ldply(data_as_list, function(item) {
sapply(letters[1:5], function(letter) letter %in% item)*1})
Given a list of character vectors, we generate a row of the resulting data.frame from each item in the list by asking whether the first 5 letters (a-e) appear in the vector (item). Multiplying by 1 is a hack to convert a boolean vector to a 1-or-0 integer vector, if that's really what you want.
Results:
.id a b c d e
1 R1 1 1 0 0 0
2 R2 0 1 0 1 0
3 R3 1 0 1 0 0
4 R4 1 1 0 0 1
To fix up the row names:
rownames(result) <- result$.id
result <- result[, -which(colnames(result)=='.id')]
Now you have:
a b c d e
R1 1 1 0 0 0
R2 0 1 0 1 0
R3 1 0 1 0 0
R4 1 1 0 0 1
Base R solution:
data_as_list = list(R1=c('a', 'b'), R2=c('b', 'd'), R3=c('a', 'c'), R4=c('a', 'b', 'e'))
stack(data_as_list)
#-----------
values ind
1 a R1
2 b R1
3 b R2
4 d R2
5 a R3
6 c R3
7 a R4
8 b R4
9 e R4
#---------
xtabs( ~ values+ind, data=stack(data_as_list) )
#-----------
ind
values R1 R2 R3 R4
a 1 0 1 1
b 1 1 0 1
c 0 0 1 0
d 0 1 0 0
e 0 0 0 1
xtabs( ~ ind+values, data=stack(data_as_list) )
#----------
values
ind a b c d e
R1 1 1 0 0 0
R2 0 1 0 1 0
R3 1 0 1 0 0
R4 1 1 0 0 1
Another approach is to use mtabulate from the "qdapTools" package. This will work for either a data.frame or a list... which should make sense, of course :-)
library(qdapTools)
x <- mtabulate(df)
x[] <- as.numeric(x > 0)
x
# V1 a b d c e
# R1 1 1 1 0 0 0
# R2 0 0 1 1 0 0
# R3 1 1 0 0 1 0
# R4 0 1 1 0 0 1
Since there are two "d" values in "R2", we use the as.numeric(x > 0) to convert to just ones and zeroes. You can drop the first column, which has counted the blanks.
I've used the sample data provided by #barerd:
df <- structure(list(R1 = structure(c(2L, 3L, 1L), .Label = c("", "a",
"b"), class = "factor"), R2 = structure(c(2L, 2L, 1L), .Label = c("b",
"d"), class = "factor"), R3 = structure(c(2L, 3L, 1L), .Label = c("",
"a", "c"), class = "factor"), R4 = structure(1:3, .Label = c("a",
"b", "e"), class = "factor")), .Names = c("R1", "R2", "R3", "R4"
), row.names = c(NA, -3L), class = "data.frame")
Here is a possibility. This could be improved to scale better.
matrix(as.numeric(rbind( ae %in% R1,
ae %in% R2,
ae %in% R3,
ae %in% R4)),4,5)
x1<-as.character(grep("[a-z]",unique(unlist(df)),value=TRUE)) #df is data
x2<-data.frame(do.call(rbind,lapply(1:ncol(df),function(i){ifelse(x1 %in% df[,i],1,0)})))
colnames(x2)<-x1
row.names(x2)<-names(df)
x2
a b d c e
R1 1 1 0 0 0
R2 0 1 1 0 0
R3 1 0 0 1 0
R4 1 1 0 0 1
First of all, I suppose this is data from a csv file or a table, which you can read into R with read.table() or read.csv().
And you should put it with dput() like:
structure(list(R1 = structure(c(2L, 3L, 1L), .Label = c("", "a",
"b"), class = "factor"), R2 = structure(c(2L, 2L, 1L), .Label = c("b",
"d"), class = "factor"), R3 = structure(c(2L, 3L, 1L), .Label = c("",
"a", "c"), class = "factor"), R4 = structure(1:3, .Label = c("a",
"b", "e"), class = "factor")), .Names = c("R1", "R2", "R3", "R4"
), row.names = c(NA, -3L), class = "data.frame")
so that we can put it into R easily.
You can rshape your data with the "reshape" library. There are many documents for reshaping data in R, including the help page, but basically you can transpose() your data, so that columns become rows. You can melt() it, so that each row becomes a unique id-variable combination like:
X1 X2 value
1 R1 1 a
2 R2 1 d
3 R3 1 a
4 R4 1 a
5 R1 2 b
6 R2 2 d
7 R3 2 c
8 R4 2 b
9 R1 3
10 R2 3 b
11 R3 3
12 R4 3 e
and then, you can cast(data, formula, function) the melted data into any shape. Since you wanted to see the distribution of the values according to R* stuff, I used the following formula:
t(cast(melt(t(t), id=c("a", "b", "c", "d", "e")), value~X1, ))[, c(2:6]
and got:
a b c d e
R1 1 1 0 0 0
R2 0 1 0 2 0
R3 1 0 1 0 0
R4 1 1 0 0 1

how to merge AND aggregate 3+ data frames of different lengths and colnames

I have three data frames of similar structure but with one different column name and different number of rows.
> a
ID count alpha
1 207 1 1
2 351 1 1
3 372 1 1
4 595 4 1
5 596 1 1
6 652 1 1
> b
ID count beta
1 207 1 1
2 351 1 1
3 372 1 1
4 1024 6 1
> c
ID count zeta
1 207 4 1
2 351 1 1
3 372 1 1
4 595 2 1
I need to make a new data frame with all columns from both (id, count, alpha, beta), while outputting the sum for count. If an ID only shows up in one data frame, it should output 0 in the corresponding column. The desired output is as follows:
> abc
ID count alpha beta zeta
1 207 6 1 1 1
2 351 3 1 1 1
3 372 3 1 1 1
4 595 6 1 0 1
5 596 1 1 0 0
6 652 1 1 0 0
7 1024 6 0 1 0
I tried merge() on a and b and got this output:
> merge(a, b, by=intersect(names(a),names(b)), all=TRUE, sort=TRUE)
id count alpha beta
1 207 1 1 1
2 351 1 1 1
3 372 1 1 1
4 595 4 1 NA
5 596 1 1 NA
6 652 1 1 NA
7 1024 6 NA 1
I'm OK with 0's being NA's but I have two major problems with this output:
(1) the count columns are not summed
(2) merge() works with just 2 data frames and I actually have a lot more (like 10)
Any advice is welcome.
Here's how I would approach this:
Create a list of the relevant data.frames (as easy as putting them all in list().
Use rbindlist (or one of the other enhanced rbind function that lets you bind datasets together by rows even if the columns are different--see "plyr" and "dplyr" for other common alternatives to rbindlist).
Here, I've used rbindlist from "data.table".
library(data.table)
rbindlist(list(a, b, c), use.names = TRUE, fill = TRUE)[
, lapply(.SD, sum, na.rm = TRUE), by = ID]
# ID count alpha beta zeta
# 1: 207 6 1 1 1
# 2: 351 3 1 1 1
# 3: 372 3 1 1 1
# 4: 595 6 1 0 1
# 5: 596 1 1 0 0
# 6: 652 1 1 0 0
# 7: 1024 6 0 1 0
I'm not sure if this is exactly how you want to deal with the "alpha", "beta", ... columns. I've just summed everything.
Sample data used in this answer:
a <- structure(list(
ID = c(207L, 351L, 372L, 595L, 596L, 652L),
count = c(1L, 1L, 1L, 4L, 1L, 1L),
alpha = c(1L, 1L, 1L, 1L, 1L, 1L)),
.Names = c("ID", "count", "alpha"),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))
b <- structure(list(
ID = c(207L, 351L, 372L, 1024L),
count = c(1L, 1L, 1L, 6L), beta = c(1L, 1L, 1L, 1L)),
.Names = c("ID", "count", "beta"),
class = "data.frame", row.names = c("1", "2", "3", "4"))
c <- structure(list(
ID = c(207L, 351L, 372L, 595L),
count = c(4L, 1L, 1L, 2L), zeta = c(1L, 1L, 1L, 1L)),
.Names = c("ID", "count", "zeta"),
class = "data.frame", row.names = c("1", "2", "3", "4"))
This can be done with dplyr in stages.
Given data:
dfA <- data.frame(c(207, 351, 372, 595, 596, 652), c(1, 1, 1, 4, 1, 1), rep(1, 6))
names(dfA) <- c('ID', 'count', 'alpha')
dfB <- data.frame(c(207, 351, 372, 1024), c(1, 1, 1, 6), rep(1, 4))
names(dfB) <- c('ID', 'count', 'beta')
dfC <- data.frame(c(207, 351, 372, 595), c(4, 1, 1, 2), rep(1, 4))
names(dfC) <- c('ID', 'count', 'zeta')
The following, while somewhat ugly, will work:
library(dplyr)
dfT <- bind_rows(dfA, dfB, dfC)
df_1 <- dfT %>% group_by(ID) %>% summarise(sum(count))
df_F <- data.frame(df_1, as.numeric(df_i$ID %in% dfA$ID), as.numeric(df_i$ID %in% dfB$ID), as.numeric(df_i$ID %in% dfC$ID))
names(df_F) <- c("ID", "count", "alpha", "beta", "zeta")
> df_F
ID count alpha beta zeta
1 207 6 1 1 1
2 351 3 1 1 1
3 372 3 1 1 1
4 595 6 1 0 1
5 596 1 1 0 0
6 652 1 1 0 0
7 1024 6 0 1 0

reshape data frame in R

I have a data frame that I need to reshape, transforming repeated values in a single column into a single row with several data columns. I know this should be simple but I can't figure out how to do this, and which of the many reshape/cast functions available I need to use.
Part of my data looks like this:
Source ID info
1 In 842701 1
2 Out 842701 1
3 In 21846591 2
4 Out 21846591 2
5 In 22181760 3
6 In 39338740 4
7 Out 9428 5
I want to make it look like this:
ID In Out info
1 842701 1 1 1
2 21846591 1 1 2
3 22181760 1 0 3
4 39338740 1 0 4
5 9428 0 1 5
and so on, while preserving all the remaining columns (which are identical for a given entry).
I would really appreciate some help. TIA.
Here is a way using reshape2
library(reshape2)
res <- dcast(transform(df, indx=1, ID=factor(ID, levels=unique(ID))),
ID~Source, value.var="indx", fill=0)
res
# ID In Out
#1 842701 1 1
#2 21846591 1 1
#3 22181760 1 0
#4 39338740 1 0
#5 9428 0 1
Or
res1 <- as.data.frame.matrix(table(transform(df,
ID=factor(ID, levels=unique(ID)))[,2:1]))
Update
dcast(transform(df1, indx=1, ID=factor(ID, levels=unique(ID))),
...~Source, value.var="indx", fill=0)
# ID info In Out
#1 842701 1 1 1
#2 21846591 2 1 1
#3 22181760 3 1 0
#4 39338740 4 1 0
#5 9428 5 0 1
You could also use reshape from base R
res2 <- reshape(transform(df1, indx=1), idvar=c("ID", "info"),
timevar="Source", direction="wide")
res2[,3:4][is.na(res2)[,3:4]] <- 0
res2
# ID info indx.In indx.Out
#1 842701 1 1 1
#3 21846591 2 1 1
#5 22181760 3 1 0
#6 39338740 4 1 0
#7 9428 5 0 1
data
df <- structure(list(Source = c("In", "Out", "In", "Out", "In", "In",
"Out"), ID = c(842701L, 842701L, 21846591L, 21846591L, 22181760L,
39338740L, 9428L)), .Names = c("Source", "ID"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))
df1 <- structure(list(Source = c("In", "Out", "In", "Out", "In", "In",
"Out"), ID = c(842701L, 842701L, 21846591L, 21846591L, 22181760L,
39338740L, 9428L), info = c(1L, 1L, 2L, 2L, 3L, 4L, 5L)), .Names = c("Source",
"ID", "info"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7"))

Resources