Stacking multiple columns in R - r

I am trying to convert data frame into long-form in R.
This is an example data for surveys conducted in 'id' grids over 9 days and if the variable of interest was detected '1' or not detected '0'.
I want to convert this data frame so that the number of surveys is reduced from 9 to 3
but each survey period now contains 3 visits.
I am trying to do so by stacking three columns at a time, so that survey visits 'v1' to 'v9' (in the image below) gets converted to v1, v2, v3 by adding a column called 'visit_no' which describe the visit number within the survey period.
The following link is the image of dataframe in current form and below if the code to generate the data
Code to generate data:
id<- c(240,220,160)
v1<- c(rep(0,9))
v2<-c(rep(0,3),1,rep(0,5))
v3<- c(1,rep(0,8))
v<-as.data.frame(rbind(v1,v2,v3))
survey<- cbind(id,v)
survey
This is the link to the image of data frame I need
Reference data-frame

One way is using reshape in base R"
reshape(survey, direction="long", idvar="id",
varying=list(c("V1","V4","V7"), c("V2","V5","V8"), c("V3","V6","V9")),
v.names=c("Visit1", "Visit2", "Visit3"), timevar="visit_no")
id visit_no Visit1 Visit2 Visit3
240.1 240 1 0 0 0
220.1 220 1 0 0 0
160.1 160 1 1 0 0
240.2 240 2 0 0 0
220.2 220 2 1 0 0
160.2 160 2 0 0 0
240.3 240 3 0 0 0
220.3 220 3 0 0 0
160.3 160 3 0 0 0
If you want it sorted by id, then add arrange from dplyr
%>% dplyr::arrange(id)
id visit_no Visit1 Visit2 Visit3
1 160 1 1 0 0
2 160 2 0 0 0
3 160 3 0 0 0
4 220 1 0 0 0
5 220 2 1 0 0
6 220 3 0 0 0
7 240 1 0 0 0
8 240 2 0 0 0
9 240 3 0 0 0
If your original variable names were in a consistent format, then the reshape command is even simpler because it will correctly guess the times from the names. For example,
names(survey)[2:10] <- paste0(names(survey)[2:10], ".", rep(1:3, 3))
head(survey)
id V1.1 V2.2 V3.3 V4.1 V5.2 V6.3 V7.1 V8.2 V9.3
v1 240 0 0 0 0 0 0 0 0 0
v2 220 0 0 0 1 0 0 0 0 0
v3 160 1 0 0 0 0 0 0 0 0
reshape(survey, direction="long", idvar="id",
varying=2:10, # Can just give the indices now.
v.names=c("Visit1", "Visit2", "Visit3"), timevar="visit_no") %>%
arrange(id)
Although the times are in a consistent format, the original variable names are not, so R cannot guess the names for the long format (Visit1, Visit2, Visit3), and these need to be supplied in the v.names argument.
If they were in a consistent format, then the reshape is even simpler.
names(survey)[2:10] <- paste0("Visit", rep(1:3, each=3), ".", rep(1:3, 3))
head(survey)
id Visit1.1 Visit1.2 Visit1.3 Visit2.1 Visit2.2 Visit2.3 Visit3.1 Visit3.2 Visit3.3
v1 240 0 0 0 0 0 0 0 0 0
v2 220 0 0 0 1 0 0 0 0 0
v3 160 1 0 0 0 0 0 0 0 0
reshape(survey, direction="long", varying=2:10, timevar="visit_no") %>%
arrange(id)
The tidyr version would probably involve two reshapes; one to get everything in very long form, and again to get it back to a wider form (what I call the 1 step back, 2 steps forward method).

You can change the name of the columns based on the sequence that you want.
names(survey)[-1] <- paste(rep(paste0("visit", 1:3), each =3), 1:3, sep = "_")
names(survey)
#[1] "id" "visit1_1" "visit1_2" "visit1_3" "visit2_1" "visit2_2" "visit2_3"
# "visit3_1" "visit3_2" "visit3_3"
And then use pivot_longer from tidyr to get data in different columns.
tidyr::pivot_longer(survey, cols = -id, names_to = c(".value", "visit_no"),
names_sep = "_") %>%
type.convert(as.is = TRUE)
# A tibble: 9 x 5
# id visit_no visit1 visit2 visit3
# <int> <int> <int> <int> <int>
#1 240 1 0 0 0
#2 240 2 0 0 0
#3 240 3 0 0 0
#4 220 1 0 1 0
#5 220 2 0 0 0
#6 220 3 0 0 0
#7 160 1 1 0 0
#8 160 2 0 0 0
#9 160 3 0 0 0

Related

Permute labels in a dataframe but for pairs of observations

Not sure title is clear or not, but I want to shuffle a column in a dataframe, but not for every individual row, which is very simple to do using sample(), but for pairs of observations from the same sample.
For instance, I have the following dataframe df1:
>df1
sampleID groupID A B C D E F
438 1 1 0 0 0 0 0
438 1 0 0 0 0 1 1
386 1 1 1 1 0 0 0
386 1 0 0 0 1 0 0
438 2 1 0 0 0 1 1
438 2 0 1 1 0 0 0
582 2 0 0 0 0 0 0
582 2 1 0 0 0 1 0
597 1 0 1 0 0 0 1
597 1 0 0 0 0 0 0
I want to randomly shuffle the labels here for groupID for each sample, not observation, so that the result looks like:
>df2
sampleID groupID A B C D E F
438 1 1 0 0 0 0 0
438 1 0 0 0 0 1 1
386 2 1 1 1 0 0 0
386 2 0 0 0 1 0 0
438 1 1 0 0 0 1 1
438 1 0 1 1 0 0 0
582 1 0 0 0 0 0 0
582 1 1 0 0 0 1 0
597 2 0 1 0 0 0 1
597 2 0 0 0 0 0 0
Notice that in column 2 (groupID), sample 386 is now 2 (for both observations).
I have searched around but haven't found anything that works the way I want. What I have now is just shuffling the second column. I tried to use dplyr as follows:
df2 <- df1 %>%
group_by(sampleID) %>%
mutate(groupID = sample(df1$groupID, size=2))
But of course that only takes all the group IDs and randomly selects 2.
Any tips or suggestions would be appreciated!
One technique would be to extract the unique combinations so you have one row per sampleID, then you can shuffle and merge the shuffled items back to the main table. Here's what that would look like
library(dplyr)
df1 %>%
distinct(sampleID, groupID) %>%
mutate(shuffle_groupID = sample(groupID)) %>%
inner_join(df1)
Using dplyr nest_by and unnest:
library(dplyr)
df1 |>
nest_by(sampleID, groupID) |>
mutate(groupID = sample(groupID, n())) |>
unnest(cols = c(data))
+ # A tibble: 10 x 3
# Groups: sampleID, groupID [4]
sampleID groupID A
<dbl> <int> <dbl>
1 386 1 1
2 386 1 0
3 438 1 0
4 438 1 0
5 438 1 0
6 438 1 1
7 582 2 0
8 582 2 0
9 597 1 1
10 597 1 0

count the number of occurrences for each variable using dplyr

Here is my data frame (tibble) df:
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
<dbl> <dbl> <dbl> <dbl> <dbl>
1 61 0 70 0 0
2 0 0 127 0 0
3 318 0 2 0 0
4 1 0 0 0 0
5 1 0 67 0 0
6 0 0 0 139 0
7 0 0 0 0 0
8 113 0 0 0 0
9 0 0 1 0 0
10 0 0 0 1 0
For each column/variable, I would like to count the number of rows with value greater than 10. In this case, column 1 would be 3, column 2 would be zero, etc. This is a test data frame, and I would like to do this for many columns.
We can use colSums on a logical matrix
colSums(df > 10, na.rm = TRUE)
Or using dplyr
library(dplyr)
df %>%
summarise_all(~ sum(. > 10, na.rm = TRUE))
I think
library(dplyr)
df %>% summarise_all(~sum(.>10))
will do what you want.

Using dplyr to gather specific dummy variables

This question is the extension of (Using dplyr to gather dummy variables) .
The question: How can I gather only a few columns, instead of the whole dataset? So in this example, I want to gather all the columns, but except "sedan". My real data set has 250 columns, so therefore it will be great if I can include/exclude the columns by name.
Data set
head(type)
x convertible coupe hatchback sedan wagon
1 0 0 0 1 0
2 0 1 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
Output
TypeOfCar
1 x
2 coupe
3 convertible
4 convertible
5 convertible
6 convertible
Not sure if i'm understanding you, but you can do what you want:
df %>% select(-sedan) %>% gather(Key, Value)
And if you have to much variables you can use:
select(-contains(""))
select(-start_wi(""))
select(-ends_with(""))
Hope it helps.
You can use -sedan in gather:
dat %>% gather(TypeOfCar, Count, -sedan) %>% filter(Count >= 1) %>% select(TypeOfCar)
# TypeOfCar
# 1 convertible
# 2 convertible
# 3 convertible
# 4 convertible
# 5 coupe
Data:
tt <- "convertible coupe hatchback sedan wagon
1 0 0 0 1 0
2 0 1 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0"
dat <- read.table(text = tt, header = T)
Fixed it with a combination of #RLave and #Carlos Vecina
right_columns <- all_data %>% select(starts_with("hour"))
all_data$all_hour <-data.frame(new_column = names(right_columns )[as.matrix(right_columns )%*%seq_along(right_columns )],stringsAsFactors=FALSE)

How to compare data frame with a factor as a variable?

I have a data frame, please see below.
How do I compare the Volume where Purchase == 1 to the previous Purchase == 1 Volume and create a factor variable V1 like shown in the Picture 2?
The df[5,"V1"] == 1 because df[5,"Volume"] > df[3,"Volume"].... and so on.
How to achieve this without using loops, how do I achieve this the vectorized way so calculation speed is faster(when dealing with millions of rows)?
I've tried sub-setting, then do the comparison but when tried to put them back to a factor variable, the number of rows of the result is not the same as the number of rows of the df therefore I cannot put the factor variable to the dataframe.
Picture 2
Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0
With data.table:
library(data.table)
data <- data.table(read.table(text=' Volume Weight Purchase V1
1 3.95670 5.27560 0 0
2 3.97110 5.29280 0 0
3 3.97200 5.29120 1 0
4 3.98640 5.31160 0 0
5 3.98880 5.31240 1 1
6 3.98700 5.31040 0 0
7 3.98370 5.31080 0 0
8 3.98580 5.31400 0 0
9 3.98670 5.31120 1 0
10 3.98460 5.29040 0 0
11 3.97710 5.28920 0 0
12 3.96720 5.26080 1 0
13 3.95190 5.26520 0 0
14 3.95160 5.26840 0 0
15 3.95340 5.26360 1 0
16 3.95370 5.23600 1 1
17 3.93450 5.23480 0 0
18 3.93480 5.23640 1 0
19 3.92760 5.23600 0 0
20 3.92820 5.22960 1 0', header=T))
data[, V1 := 0]
data[Purchase == 1, V1 := as.integer(Volume > shift(Volume)) ]
data[, V1 := as.factor(V1)]
Here, I filtered data to where Purchase = 1, then I brought previous Volume with shift function.
Finally, I compared Volume to Previous volume and assigned 1 if Volume is larger than Previous.

Aggregate R data frame over count of a field: Pivot table-like result set [duplicate]

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I have a data frame in the following structure
ChannelId,AuthorId
1,32
28,2393293
2,32
2,32
1,2393293
31,3
3,32
5,4
2,5
What I want is
AuthorId,1,2,3,5,28,31
4,0,0,0,1,0,0
3,0,0,0,0,0,1
5,0,1,0,0,0,0
32,1,2,0,1,0,0
2393293,1,0,0,0,1,0
Is there a way to do this?
The xtabs function can be called with a formula that specifies the margins:
xtabs( ~ AuthorId+ChannelId, data=dat)
ChannelId
AuthorId 1 2 28 3 31 5
2393293 1 0 1 0 0 0
3 0 0 0 0 1 0
32 1 2 0 1 0 0
4 0 0 0 0 0 1
5 0 1 0 0 0 0
Perhaps the simplest way would be: t(table(df)):
# ChannelId
#AuthorId 1 2 3 5 28 31
# 3 0 0 0 0 0 1
# 4 0 0 0 1 0 0
# 5 0 1 0 0 0 0
# 32 1 2 1 0 0 0
# 2393293 1 0 0 0 1 0
If you want to use dplyr::count you could do:
library(dplyr)
library(tidyr)
df %>%
count(AuthorId, ChannelId) %>%
spread(ChannelId, n, fill = 0)
Which gives:
#Source: local data frame [5 x 7]
#Groups: AuthorId [5]
#
# AuthorId 1 2 3 5 28 31
#* <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 3 0 0 0 0 0 1
#2 4 0 0 0 1 0 0
#3 5 0 1 0 0 0 0
#4 32 1 2 1 0 0 0
#5 2393293 1 0 0 0 1 0
We can also use dcast from data.table. Convert the 'data.frame' to 'data.table' and use dcast with the fun.aggregate as length.
library(data.table)
dcast(setDT(df1), AuthorId~ChannelId, length)
# AuthorId 1 2 3 5 28 31
#1: 3 0 0 0 0 0 1
#2: 4 0 0 0 1 0 0
#3: 5 0 1 0 0 0 0
#4: 32 1 2 1 0 0 0
#5: 2393293 1 0 0 0 1 0

Resources