stacking rows as columns in R [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I'm trying to stack rows of data into columns so that the variables in another column will repeat. I would like to turn something like this
tib <- tribble(~x, ~y, ~z, "a", 1,2, "b", 3,4)
> tib
# A tibble: 2 x 3
x y z
<chr> <dbl> <dbl>
1 a 1 2
2 b 3 4
into
t <- tribble(~X, ~Y, "a", 1, "a", 2, "b", 3, "b", 4)
> t
# A tibble: 4 x 2
X Y
<chr> <dbl>
1 a 1
2 a 2
3 b 3
4 b 4
Thanks for your help and sorry if I've missed this solution somewhere. I did a search, and tried applying gather(), spread(), but couldn't get it to work out.

Here is an example using data.table::melt():
# Assuming your data is a data.frame
xyz <- data.frame(
x = c("a", "b"),
y = c(1, 3),
z = c(2, 4)
)
library(data.table)
melt(xyz, id.vars = "x")[c(1, 3)]
x value
1 a 1
2 b 3
3 a 2
4 b 4

This can be done with many packages. One possibility is tidyr and the function gather (link)
EDIT
Using #sindri_baldur data:
library(tidyr)
xyz %>%
gather(class, measurement, -x)

Related

Create dummy variable based on conditions in multiple other columns [duplicate]

This question already has answers here:
How do I create a new column based on multiple conditions from multiple columns?
(3 answers)
Closed 7 months ago.
I am looking for help in adding a dummy variable to an existing dataframe based on conditions in multiple columns (this last bit is what separates my question from the answers I already found).
Here's a simple example:
y <- c(1,2,5,2,3,3)
z <- c("A", "B", "B", "A", "A", "B")
df <- as.data.frame(y,z)
Now I'd like to have a third column, which takes the value '1' if y is equal to 2 or if z is equal to B. So the column would show a value of 1 for all observations except the first (A,1) and the fifth (A,3).
I'm sure I know all the ingredients for doing this, I just cannot put it together right now. Any help would be much appreciated!
dplyr option using case_when:
y <- c(1,2,5,2,3,3)
z <- c("A", "B", "B", "A", "A", "B")
df <- data.frame(y = y, z = z)
library(dplyr)
df %>%
mutate(dummy = case_when(y == 2|z == "B"~1,
TRUE ~ 0))
#> y z dummy
#> 1 1 A 0
#> 2 2 B 1
#> 3 5 B 1
#> 4 2 A 1
#> 5 3 A 0
#> 6 3 B 1
Created on 2022-07-19 by the reprex package (v2.0.1)

Subset a data frame based on count of values of column x. Want only the top two in R

here is the data frame
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b")
df <- data.frame(p, x)
I want to subset the data frame such that I get a new data frame with only the top two"x" based on the count of "x".
One of the simplest ways to achieve what you want to do is with the package data.table. You can read more about it here. Basically, it allows for fast and easy aggregation of your data.
Please note that I modified your initial data by appending the elements 10 and c to p and x, respectively. This way, you won't see a NA when filtering the top two observations.
The idea is to sort your dataset and then operate the function .SD which is a convenient way for subsetting/filtering/extracting observations.
Please, see the code below.
library(data.table)
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2, 10)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b", "c")
df <- data.table(p, x)
# Sort by the group x and then by p in descending order
setorder( df, x, -p )
# Extract the first two rows by group "x"
top_two <- df[ , .SD[ 1:2 ], by = x ]
top_two
#> x p
#> 1: a 45
#> 2: a 6
#> 3: b 6
#> 4: b 3
#> 5: c 54
#> 6: c 10
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?
Using dplyr:
library(dplyr)
df %>%
add_count(x) %>%
slice_max(n, n = 2)
p x n
1 1 a 4
2 3 b 4
3 45 a 4
4 1 a 4
5 1 b 4
6 6 a 4
7 6 b 4
8 2 b 4

How to convert a data frame in R? [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
I've got an R data frame in the form of
my.data1 = data.frame(sex = c("m", "f"),
A = c(1, 2),
B = c(3, 4))
However, I'd like my data to be in the form of
my.data2 = data.frame(value = c(1, 2, 3, 4),
group = c("A", "A", "B", "B"),
sex = c("m", "f", "m", "f"))
So basically, I wanna turn some former columns ("A" and "B") into table cells under the new column "group" and simultaneously collect all former table cells under one new column "value".
What is the easiest way to convert the data accordingly?
Thanks in advance!
A close result to what you want is reached using reshape2 and melt() function. You can define a variable as id and the data is reshaped:
library(reshape2)
#Data
my.data1 = data.frame(sex = c("m", "f"),
A = c(1, 2),
B = c(3, 4))
#Reshape
my.data2 <- melt(my.data1,id.vars = 'sex')
Output:
sex variable value
1 m A 1
2 f A 2
3 m B 3
4 f B 4
If you wanna go further, you can use a tidyverse approach with pivot_longer(). In this function you also have to set a reference column as id with cols argument:
library(tidyverse)
my.data1 %>% pivot_longer(cols = -sex)
Output:
# A tibble: 4 x 3
sex name value
<fct> <chr> <dbl>
1 m A 1
2 m B 3
3 f A 2
4 f B 4

How to create a rank for a variable in a longitudinal dataset based on a condition?

I have a longitudinal dataset where each subject is represented more than once. One represents one admission for a patient. Each admission, regardless of the subject also has a unique "key". I need to figure out which admission is the "INDEX" admission, that is, the first admission, so that I know that which rows are the subsequent RE-admission. The variable to use is "Daystoevent"; the lowest number represents the INDEX admission. I want to create a new variable based on the condition that for each subject, the lowest number in the variable "Daystoevent" is the "index" admission and each subsequent gets a number "1" , "2" etc. I want to do this WITHOUT changing into the horizontal format.
The dataset looks like this:
Subject Daystoevent Key
A 5 rtwe
A 8 erer
B 3 tter
B 8 qgfb
A 2 sada
C 4 ccfw
D 7 mjhr
B 4 sdfw
C 1 srtg
C 2 xcvs
D 3 muyg
Would appreciate some help.
This may not be an elegant solution but will do the job:
library(dplyr)
df <- df %>%
group_by(Subject) %>%
arrange(Subject, Daystoevent) %>%
mutate(
Admission = if_else(Daystoevent == min(Daystoevent), 0, 1),
) %>%
ungroup()
for(i in 1:(nrow(df) - 1)) {
if(df$Admission[i] == 1) {
df$Admission[i + 1] <- 2
} else if(df$Admission[i + 1] != 0){
df$Admission[i + 1] <- df$Admission[i] + 1
}
}
df[df == 0] <- "index"
df
# # A tibble: 11 x 4
# Subject Daystoevent Key Admission
# <chr> <dbl> <chr> <chr>
# 1 A 2 sada index
# 2 A 5 rtwe 1
# 3 A 8 erer 2
# 4 B 3 tter index
# 5 B 4 sdfw 1
# 6 B 8 qgfb 2
# 7 C 1 srtg index
# 8 C 2 xcvs 1
# 9 C 4 ccfw 2
# 10 D 3 muyg index
# 11 D 7 mjhr 1
Data:
df <- data_frame(
Subject = c("A", "A", "B", "B", "A", "C", "D", "B", "C", "C", "D"),
Daystoevent = c(5, 8, 3, 8, 2, 4, 7, 4, 1, 2, 3),
Key = c("rtwe", "erer", "tter", "qgfb", "sada", "ccfw", "mjhr", "sdfw", "srtg", "xcvs", "muyg")
)

Ordering a dataframe by its subsegments

My team and I are dealing with many thousands of URLs that have similar segments.
Some URLs have one segment ("seg", plural, "segs") in a position of interest to us. Other similar URLs have a different seg in the position of interest to us.
We need to sort a dataframe consisting of URLs and associated unique segs
in the position of interest, showing the frequency of those unique segs.
Here is a simplified example:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
We are looking for the following:
url freq seg
1 3 a in other words, url #1 appears three times each with a seg = "a",
2 2 b in other words: url #2 appears twice each with a seg = "b",
3 3 c in other words: url #3 appears three times with a seg = "c",
3 2 x two times with a seg = "x", and,
3 1 y once with a seg = "y"
4 1 d etc.
I can get there using a loop and several small steps, but am convinced there is a more elegant way of doing this. Here's my inelegant approach:
Create empty dataframe with num.unique rows and three columns (url, freq, seg)
result <- data.frame(url=0, Freq=0, seg=0)
Determine the unique URLs
unique.df.url <- unique(df$url)
Loop through the dataframe
for (xx in unique.df.url) {
url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
freq.df.url <- data.frame(table(url.seg)) # summarize the frequency distribution of the segs by url
result <- rbind(result,freq.df.url) # append a new data.frame onto the last one
}
Eliminate rows in the dataframe where Frequency = 0
result.freq <- result[which(result$Freq |0), ]
Sort the dataframe by URL
result.order <- result.freq[order(result.freq$url), ]
This yields the desired results, but since it is so inelegant, I am concerned that once we move to scale, the time required will be prohibitive or at least a concern. Any suggestions?
In base R you can do this :
aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
# or aggregate(freq~seg+url, data.frame(df,freq=1),sum)
# seg url freq
# 1 a 1 3
# 2 b 2 2
# 3 c 3 3
# 4 x 3 2
# 5 y 3 1
# 6 d 4 1
The trick with $<- is just to add a column freq of value 1 everywhere, without changing your source table.
Another possibility:
subset(as.data.frame(table(df[2:1])),Freq!=0)
# seg url Freq
# 1 a 1 3
# 8 b 2 2
# 15 c 3 3
# 17 x 3 2
# 18 y 3 1
# 22 d 4 1
Here I use [2:1] to switch the order of columns so table orders the results in the required way.
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
library(dplyr)
df %>% count(url, seg) %>% arrange(url, desc(n))
# # A tibble: 6 x 3
# url seg n
# <dbl> <fct> <int>
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
Would the following code be better for you?
library(dplyr)
df %>% group_by(url, seg) %>% summarise(n())
Or paste & tapply:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
want <- tapply(url, INDEX = paste(url, seg, sep = "_"), length)
want <- data.frame(do.call(rbind, strsplit(names(want), "_")), want)
colnames(want) <- c("url", "seg", "freq")
want <- want[order(want$url, -want$freq), ]
rownames(want) <- NULL # needed?
want <- want[ , c("url", "freq", "seg")] # needed?
want
An option can be to use table and tidyr::gather to get data in format needed by OP:
library(tidyverse)
table(df) %>% as.data.frame() %>%
filter(Freq > 0 ) %>%
arrange(url, desc(Freq))
# url seg Freq
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
OR
df %>% group_by(url, seg) %>%
summarise(freq = n()) %>%
arrange(url, desc(freq))
# # A tibble: 6 x 3
# # Groups: url [4]
# url seg freq
# <dbl> <fctr> <int>
# 1 1.00 a 3
# 2 2.00 b 2
# 3 3.00 c 3
# 4 3.00 x 2
# 5 3.00 y 1
# 6 4.00 d 1

Resources