I have a data frame that I need to reshape, transforming repeated values in a single column into a single row with several data columns. I know this should be simple but I can't figure out how to do this, and which of the many reshape/cast functions available I need to use.
Part of my data looks like this:
Source ID info
1 In 842701 1
2 Out 842701 1
3 In 21846591 2
4 Out 21846591 2
5 In 22181760 3
6 In 39338740 4
7 Out 9428 5
I want to make it look like this:
ID In Out info
1 842701 1 1 1
2 21846591 1 1 2
3 22181760 1 0 3
4 39338740 1 0 4
5 9428 0 1 5
and so on, while preserving all the remaining columns (which are identical for a given entry).
I would really appreciate some help. TIA.
Here is a way using reshape2
library(reshape2)
res <- dcast(transform(df, indx=1, ID=factor(ID, levels=unique(ID))),
ID~Source, value.var="indx", fill=0)
res
# ID In Out
#1 842701 1 1
#2 21846591 1 1
#3 22181760 1 0
#4 39338740 1 0
#5 9428 0 1
Or
res1 <- as.data.frame.matrix(table(transform(df,
ID=factor(ID, levels=unique(ID)))[,2:1]))
Update
dcast(transform(df1, indx=1, ID=factor(ID, levels=unique(ID))),
...~Source, value.var="indx", fill=0)
# ID info In Out
#1 842701 1 1 1
#2 21846591 2 1 1
#3 22181760 3 1 0
#4 39338740 4 1 0
#5 9428 5 0 1
You could also use reshape from base R
res2 <- reshape(transform(df1, indx=1), idvar=c("ID", "info"),
timevar="Source", direction="wide")
res2[,3:4][is.na(res2)[,3:4]] <- 0
res2
# ID info indx.In indx.Out
#1 842701 1 1 1
#3 21846591 2 1 1
#5 22181760 3 1 0
#6 39338740 4 1 0
#7 9428 5 0 1
data
df <- structure(list(Source = c("In", "Out", "In", "Out", "In", "In",
"Out"), ID = c(842701L, 842701L, 21846591L, 21846591L, 22181760L,
39338740L, 9428L)), .Names = c("Source", "ID"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))
df1 <- structure(list(Source = c("In", "Out", "In", "Out", "In", "In",
"Out"), ID = c(842701L, 842701L, 21846591L, 21846591L, 22181760L,
39338740L, 9428L), info = c(1L, 1L, 2L, 2L, 3L, 4L, 5L)), .Names = c("Source",
"ID", "info"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7"))
Related
I have a dataframe that looks like the following:
ID V1 V2 V3 V4 V5
1 a 6 3 5 3
2 c 4 1 2 1
3 g 8 2 4 2
4 h 7 9 8 1
5 a 4 6 2 1
6 b 4 2 1 2
7 j 8 7 1 4
I need to create a new dummy variable and add it to this dataframe as column "V6". I need to do it based on a matrix from an external spreadsheet such as the following:
V1 1 2 3 4 5 6 7 8 9
a 1 1 1 1 1
b 1 1 1 1 1 1 1
c 1
d 1 1 1 1 1 1 1 1 1
g 1 1
h 1 1
i 1 1 1
j
k 1 1 1 1 1
In the above matrix, the V1 column is the value of the V1 variable in the original dataframe, and the other columns correspond with possible values of the V5 variable. All the empty spaces are blank in the spreadsheet. I need the new dummy variable, V6 to represent 1 if the unit is a 1 on the matrix based on the intersection of values. The result would therefore be the following:
ID V1 V2 V3 V4 V5 V6
1 a 6 3 5 3 0
2 c 4 1 2 1 0
3 g 8 2 4 2 1
4 h 7 9 8 1 0
5 a 4 6 2 1 1
6 b 4 2 1 2 0
7 j 8 7 1 4 0
ID 1 is a 0 for the V6 variable, because in the matrix, a and the value 3 intersect at a blank (or 0). Therefore the dummy variable for row 1 is a 0, because its V1 is a and its V5 is 3. Conversely, the third row generates a 1, because its V1 is G and its V5 value is 2. That intersection on the matrix, g-2 is a 1, therefore V6 for that combination is a "hit", or a 1 in the dummy variable
I recognize this is an odd method of dummy variable creation, but how can one use an externally created spreadsheet like this to create dummy variables based on the intersection of values most efficiently? What would be a flexible way to code this, so that it could be adapted depending on if the variables are character or numeric?
I think it's best to approach this by pivoting/reshaping df2 (the 1s and blanks), and joining it on df1 (original data).
Note: it isn't abundantly clear if your df2 has empty strings or NA values. If the latter, then replace the nzchar(V6) with !is.na(V6) or !V6 %in% c(NA, "") (for both possibilities).
base R
out <- reshape2::melt(df2, "V1", variable.name = "V5", value.name = "V6") |>
subset(nzchar(V6)) |>
merge(df1, by = c("V1", "V5"), all.y = TRUE) |>
transform(V6 = +(!is.na(V6)))
out
# V1 V5 V6 ID V2 V3 V4
# 1 a 1 1 5 4 6 2
# 2 a 3 0 1 6 3 5
# 3 b 2 0 6 4 2 1
# 4 c 1 0 2 4 1 2
# 5 g 2 1 3 8 2 4
# 6 h 1 0 4 7 9 8
# 7 j 4 0 7 8 7 1
The rows/columns are out of order, we can restore it fairly easily:
out <- out[order(out$ID), c("ID", sort(setdiff(names(out), "ID")))]
out
# ID V1 V2 V3 V4 V5 V6
# 2 1 a 6 3 5 3 0
# 4 2 c 4 1 2 1 0
# 5 3 g 8 2 4 2 1
# 6 4 h 7 9 8 1 0
# 1 5 a 4 6 2 1 1
# 3 6 b 4 2 1 2 0
# 7 7 j 8 7 1 4 0
dplyr/tidyr
library(dplyr)
library(tidyr) # pivot_longer
df2 %>%
pivot_longer(-V1, names_to = "V5", values_to = "V6") %>%
filter(nzchar(V6)) %>%
# dplyr requires the join columns to be the same class, but the
# column names from `df2` are still character, as all column names are
mutate(V5 = as.integer(V5)) %>%
left_join(df1, ., by = c("V1", "V5")) %>%
mutate(V6 = +(!is.na(V6)))
# ID V1 V2 V3 V4 V5 V6
# 1 1 a 6 3 5 3 0
# 2 2 c 4 1 2 1 0
# 3 3 g 8 2 4 2 1
# 4 4 h 7 9 8 1 0
# 5 5 a 4 6 2 1 1
# 6 6 b 4 2 1 2 0
# 7 7 j 8 7 1 4 0
Data
df1 <- structure(list(ID = 1:7, V1 = c("a", "c", "g", "h", "a", "b", "j"), V2 = c(6L, 4L, 8L, 7L, 4L, 4L, 8L), V3 = c(3L, 1L, 2L, 9L, 6L, 2L, 7L), V4 = c(5L, 2L, 4L, 8L, 2L, 1L, 1L), V5 = c(3L, 1L, 2L, 1L, 1L, 2L, 4L)), class = "data.frame", row.names = c(NA, -7L))
df2 <- structure(list(V1 = c("a", "b", "c", "d", "g", "h", "i", "j", "k"), "1" = c("1", "1", "", "1", "1", "", "", "", "1"), "2" = c("1", "", "", "1", "1", "1", "", "", "1"), "3" = c("", "", "", "1", "", "", "", "", "1"), "4" = c("1", "1", "", "1", "", "", "", "", "1"), "5" = c("1", "1", "", "1", "", "", "1", "", "1"), "6" = c("1", "1", "", "1", "", "", "1", "", ""), "7" = c("", "1", "", "1", "", "", "", "", ""), "8" = c("", "1", "", "1", "", "", "", "", ""), "9" = c("", "1", "1", "1", "", "1", "1", "", "")), row.names = c(NA, -9L), class = "data.frame")
I have the following data.frames(sample):
>df1
number ACTION
1 1 this
2 2 that
3 3 theOther
4 4 another
>df2
id VALUE
1 1 3
2 2 4
3 3 2
4 4 1
4 5 4
4 6 2
4 7 3
. . .
. . .
I would like df2 to become like the following:
>df2
id VALUE
1 1 theOther
2 2 another
3 3 that
4 4 this
4 5 another
4 6 that
4 7 theOther
. . .
. . .
It can be done 'mannualy' by using the following for each value:
df2[df2==1] <- 'this'
df2[df2==2] <- 'that'
.
.
and so on, but is there a way to do it not mannualy?
Try
df2$VALUE <- setNames(df1$ACTION, df1$number)[as.character(df2$VALUE)]
df2
# id VALUE
#1 1 theOther
#2 2 another
#3 3 that
#4 4 this
#5 5 another
#6 6 that
#7 7 theOther
Or use match
df2$VALUE <- df1$ACTION[match(df2$VALUE, df1$number)]
data
df1 <- structure(list(number = 1:4, ACTION = c("this", "that",
"theOther",
"another")), .Names = c("number", "ACTION"), class = "data.frame",
row.names = c("1", "2", "3", "4"))
df2 <- structure(list(id = 1:7, VALUE = c(3L, 4L, 2L, 1L, 4L, 2L, 3L
)), .Names = c("id", "VALUE"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))
You could do:
library(qdapTools)
df2$VALUE <- lookup(terms = df2$VALUE, key.match = df1)
Note that for this to work, you will need the proper columns order in df1. From ?lookup
key.match
Takes one of the following: (1) a two column data.frame of a match key
and reassignment column, (2) a named list of vectors (Note: if
data.frame or named list supplied no key reassign needed) or (3) a
single vector match key.
Which gives:
# id VALUE
#1 1 theOther
#2 2 another
#3 3 that
#4 4 this
#5 5 another
#6 6 that
#7 7 theOther
After running kCliques in RBGL, I have a list comprised of cliques and their members.
I wish to construct a member-by-clique matrix from the list object created by kCliques.
As an example:
con <- file(system.file("XML/snacliqueex.gxl",package="RBGL"))
coex <- fromGXL(con)
close(con)
kcl <- kCliques(coex)
which results in
kcl<-structure(list(`1-cliques` = list(c("1", "2", "3"), c("2", "4"),
c("3", "5"), c("4", "6"), c("5", "6")), `2-cliques` = list(
c("1", "2", "3", "4", "5"), c("2", "3", "4", "5", "6")),
`3-cliques` = list(c("1", "2", "3", "4", "5", "6"))),
.Names = c("1-cliques", "2-cliques", "3-cliques"))
kcl is a list where elements are character vectors indicating clique members.
I wish to construct a member-by-clique matrix where cell i,j indicates whether node i is a member of clique j.
Here are some transformations that should work
#remove one level of nesting
x <- do.call("c", kcl);
#assign number to each cliqeq
xx <- do.call("rbind", Map(function(x,y) data.frame(x,y), x, seq_along(x)));
#track participation
xtabs(~x+y, xx)
which gives
y
x 1 2 3 4 5 6 7 8
1 1 0 0 0 0 1 0 1
2 1 1 0 0 0 1 1 1
3 1 0 1 0 0 1 1 1
4 0 1 0 1 0 1 1 1
5 0 0 1 0 1 1 1 1
6 0 0 0 1 1 0 1 1
I have three data frames of similar structure but with one different column name and different number of rows.
> a
ID count alpha
1 207 1 1
2 351 1 1
3 372 1 1
4 595 4 1
5 596 1 1
6 652 1 1
> b
ID count beta
1 207 1 1
2 351 1 1
3 372 1 1
4 1024 6 1
> c
ID count zeta
1 207 4 1
2 351 1 1
3 372 1 1
4 595 2 1
I need to make a new data frame with all columns from both (id, count, alpha, beta), while outputting the sum for count. If an ID only shows up in one data frame, it should output 0 in the corresponding column. The desired output is as follows:
> abc
ID count alpha beta zeta
1 207 6 1 1 1
2 351 3 1 1 1
3 372 3 1 1 1
4 595 6 1 0 1
5 596 1 1 0 0
6 652 1 1 0 0
7 1024 6 0 1 0
I tried merge() on a and b and got this output:
> merge(a, b, by=intersect(names(a),names(b)), all=TRUE, sort=TRUE)
id count alpha beta
1 207 1 1 1
2 351 1 1 1
3 372 1 1 1
4 595 4 1 NA
5 596 1 1 NA
6 652 1 1 NA
7 1024 6 NA 1
I'm OK with 0's being NA's but I have two major problems with this output:
(1) the count columns are not summed
(2) merge() works with just 2 data frames and I actually have a lot more (like 10)
Any advice is welcome.
Here's how I would approach this:
Create a list of the relevant data.frames (as easy as putting them all in list().
Use rbindlist (or one of the other enhanced rbind function that lets you bind datasets together by rows even if the columns are different--see "plyr" and "dplyr" for other common alternatives to rbindlist).
Here, I've used rbindlist from "data.table".
library(data.table)
rbindlist(list(a, b, c), use.names = TRUE, fill = TRUE)[
, lapply(.SD, sum, na.rm = TRUE), by = ID]
# ID count alpha beta zeta
# 1: 207 6 1 1 1
# 2: 351 3 1 1 1
# 3: 372 3 1 1 1
# 4: 595 6 1 0 1
# 5: 596 1 1 0 0
# 6: 652 1 1 0 0
# 7: 1024 6 0 1 0
I'm not sure if this is exactly how you want to deal with the "alpha", "beta", ... columns. I've just summed everything.
Sample data used in this answer:
a <- structure(list(
ID = c(207L, 351L, 372L, 595L, 596L, 652L),
count = c(1L, 1L, 1L, 4L, 1L, 1L),
alpha = c(1L, 1L, 1L, 1L, 1L, 1L)),
.Names = c("ID", "count", "alpha"),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))
b <- structure(list(
ID = c(207L, 351L, 372L, 1024L),
count = c(1L, 1L, 1L, 6L), beta = c(1L, 1L, 1L, 1L)),
.Names = c("ID", "count", "beta"),
class = "data.frame", row.names = c("1", "2", "3", "4"))
c <- structure(list(
ID = c(207L, 351L, 372L, 595L),
count = c(4L, 1L, 1L, 2L), zeta = c(1L, 1L, 1L, 1L)),
.Names = c("ID", "count", "zeta"),
class = "data.frame", row.names = c("1", "2", "3", "4"))
This can be done with dplyr in stages.
Given data:
dfA <- data.frame(c(207, 351, 372, 595, 596, 652), c(1, 1, 1, 4, 1, 1), rep(1, 6))
names(dfA) <- c('ID', 'count', 'alpha')
dfB <- data.frame(c(207, 351, 372, 1024), c(1, 1, 1, 6), rep(1, 4))
names(dfB) <- c('ID', 'count', 'beta')
dfC <- data.frame(c(207, 351, 372, 595), c(4, 1, 1, 2), rep(1, 4))
names(dfC) <- c('ID', 'count', 'zeta')
The following, while somewhat ugly, will work:
library(dplyr)
dfT <- bind_rows(dfA, dfB, dfC)
df_1 <- dfT %>% group_by(ID) %>% summarise(sum(count))
df_F <- data.frame(df_1, as.numeric(df_i$ID %in% dfA$ID), as.numeric(df_i$ID %in% dfB$ID), as.numeric(df_i$ID %in% dfC$ID))
names(df_F) <- c("ID", "count", "alpha", "beta", "zeta")
> df_F
ID count alpha beta zeta
1 207 6 1 1 1
2 351 3 1 1 1
3 372 3 1 1 1
4 595 6 1 0 1
5 596 1 1 0 0
6 652 1 1 0 0
7 1024 6 0 1 0
This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I need help with converting my long data of dimension 1558810 x 84 to a wide data of 1558810 x 4784
Let me explain in detail how and why. My raw data is as follows -
The data has three main columns -
id empId dept
1 a social
2 a Hist
3 a math
4 b comp
5 a social
6 b comp
7 c math
8 c Hist
9 b math
10 a comp
id is the unique key that tells which employee went to which department in a university on a day. I need this to be transformed as below.
id empId dept social Hist math comp
1 a social 1 0 0 0
2 a Hist 0 1 0 0
3 a math 0 0 1 0
4 b comp 0 0 0 1
5 a social 1 0 0 0
6 b comp 0 0 0 1
7 c math 0 0 1 0
8 c Hist 0 1 0 0
9 b math 0 0 1 0
10 a comp 0 0 0 1
I have two datasets one with 49k rows and one with 1.55million rows. For the smaller dataset which had 1100 unique department values, I used dcast in the reshape2 package to get the desired dataset(thus, transformed data would have 3+1100 columns and 49k rows). But when I use the same function on my larger dataset that has 4700 unique department values, my R crashes because of Memory issue. I tried varous other alternative like xtabs, reshape etc. but every time it failed with Memory error.
I have now resorted to a crude FOR loop for this purpose -
columns <- unique(ds$dept)
for(i in 1:length(unique(ds$dept)))
{
ds[,columns[i]] <- ifelse(ds$dept==columns[i],1,0)
}
But this is extremely slow and the code has been running for 10 hrs now. Is there any workaround for this, that I am missing?
ANy suggestions would be of great help!
You could try
df$dept <- factor(df$dept, levels=unique(df$dept))
res <- cbind(df, model.matrix(~ 0+dept, df))
colnames(res) <- gsub("dept(?=[A-Za-z])", "", colnames(res), perl=TRUE)
res
# id empId dept social Hist math comp
#1 1 a social 1 0 0 0
#2 2 a Hist 0 1 0 0
#3 3 a math 0 0 1 0
#4 4 b comp 0 0 0 1
#5 5 a social 1 0 0 0
#6 6 b comp 0 0 0 1
#7 7 c math 0 0 1 0
#8 8 c Hist 0 1 0 0
#9 9 b math 0 0 1 0
#10 10 a comp 0 0 0 1
Or you could try
cbind(df, as.data.frame.matrix(table(df[,c(1,3)])))
Or using data.table
library(data.table)
setDT(df)
dcast.data.table(df, id + empId + dept ~ dept, fun=length)
Or using qdap
library(qdap)
cbind(df, as.wfm(with(df, mtabulate(setNames(dept, id)))))
data
df <- structure(list(id = 1:10, empId = c("a", "a", "a", "b", "a",
"b", "c", "c", "b", "a"), dept = c("social", "Hist", "math",
"comp", "social", "comp", "math", "Hist", "math", "comp")), .Names = c("id",
"empId", "dept"), class = "data.frame", row.names = c(NA, -10L))
Try:
> cbind(dd[1:3], dcast(dd, dd$id~dd$dept, length)[-1])
Using dept as value column: use value.var to override.
id empId dept comp Hist math social
1 1 a social 0 0 0 1
2 2 a Hist 0 1 0 0
3 3 a math 0 0 1 0
4 4 b comp 1 0 0 0
5 5 a social 0 0 0 1
6 6 b comp 1 0 0 0
7 7 c math 0 0 1 0
8 8 c Hist 0 1 0 0
9 9 b math 0 0 1 0
10 10 a comp 1 0 0 0
data:
> dput(dd)
structure(list(id = 1:10, empId = structure(c(1L, 1L, 1L, 2L,
1L, 2L, 3L, 3L, 2L, 1L), .Label = c("a", "b", "c"), class = "factor"),
dept = structure(c(4L, 2L, 3L, 1L, 4L, 1L, 3L, 2L, 3L, 1L
), .Label = c("comp", "Hist", "math", "social"), class = "factor")), .Names = c("id",
"empId", "dept"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10"))