This is an example dataframe. My real dataframe is larger. I highly prefer a tidyverse solution.
#my data
age <- c(18,18,19)
A1 <- c(3,5,3)
A2 <- c(4,4,3)
B1 <- c(1,5,2)
B2 <- c(2,2,5)
df <- data.frame(age, A1, A2, B1, B2)
I want my data to look like this:
#what i want
new_age <- c(18,18,18,18,19,19)
A <- c(3,5,4,4,3,3)
B <- c(1,5,2,2,2,5)
new_df <- data.frame(new_age, A, B)
I want to pivot longer and stack columns A1:A2 into column A, and B1:B2 into B. I also want to have the responses to match the correct age. For example, the 19 year old person in this example has only responded with 3's in columns A1:A2.
tidyr::pivot_longer(df, cols = -age, names_to = c(".value",'groupid'),
#1+ non digits followed by 1+ digits
names_pattern = "(\\D+)(\\d+)")
# A tibble: 6 x 4
age groupid A B
<dbl> <chr> <dbl> <dbl>
1 18 1 3 1
2 18 2 4 2
3 18 1 5 5
4 18 2 4 2
5 19 1 3 2
6 19 2 3 5
in Base R you will use reshape then select the columns you want. You can change the row names also
reshape(df,2:ncol(df),dir = "long",sep="")[,-c(2,5)] #
age A B
1.1 18 3 1
2.1 18 5 5
3.1 19 3 2
1.2 18 4 2
2.2 18 4 2
3.2 19 3 5
As you have a larger dataframe, maybe a solution with data.table will be faster. Here, you can use melt function from data.table package as follow:
library(data.table)
colA = grep("A",colnames(df),value = TRUE)
colB = grep("B",colnames(df),value = TRUE)
setDT(df)
df <- melt(df, measure = list(colA,colB), value.name = c("A","B"))
df[,variable := NULL]
dt <- dt[order(age)]
age A B
1: 18 3 1
2: 18 5 5
3: 18 4 2
4: 18 4 2
5: 19 3 2
6: 19 3 5
Does it answer your question ?
EDIT: Using patterns - suggestion from #Wimpel
As #Wimpel suggested it in comments, you can get the same result using patterns:
melt( setDT(df), measure.vars = patterns( A="^A[0-9]", B="^B[0-9]") )[, variable:=NULL][]
age A B
1: 18 3 1
2: 18 5 5
3: 19 3 2
4: 18 4 2
5: 18 4 2
6: 19 3 5
Related
I have missing categorical variables in a list. I would like to add all the combinations of these classifications to the data frame using complete. I can do this for a single variable using mutate.
Simplified example:
library(tidyverse)
df <- tibble(a1 = 1:6,
b1 = rep(c(1,2),3),
c1 = rep(c(1:3), 2))
missing_cols <- list(d1 = c(7:8),
e1 = c(12:14))
# Use the first classification of d1 for mutate and complete with all classifications
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]])
Desired output
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
mutate(!!names(missing_cols)[2] := missing_cols[[2]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]], e1 = missing_cols[[2]])
This will get the correct output for d1. How can I do this for all variables in my list?
We can use crossing with cross_df :
library(tidyr)
crossing(df, cross_df(missing_cols))
# a1 b1 c1 d1 e1
# <int> <dbl> <int> <int> <int>
# 1 1 1 1 7 12
# 2 1 1 1 7 13
# 3 1 1 1 7 14
# 4 1 1 1 8 12
# 5 1 1 1 8 13
# 6 1 1 1 8 14
# 7 2 2 2 7 12
# 8 2 2 2 7 13
# 9 2 2 2 7 14
#10 2 2 2 8 12
# … with 26 more rows
cross_df creates all possible combination of missing_cols while crossing takes that output and creates all possible combination with df.
Using expand.grid
library(tidyr)
crossing(df, expand.grid(missing_cols))
I have two different columns for several samples, which are connected. I want to merge all columns of type 1 to one column and all of type 2 to one column, but the rows should stay connected.
Example:
a1 <- c(1, 2, 3, 4, 5)
b1 <- c(1, 4, 9, 16, 25)
a2 <- c(2, 4, 6, 8, 10)
b2 <- c(4, 8, 12, 16, 20)
df1 <- data.frame(a1, b1, a2, b2)
a1 b1 a2 b2
1 1 1 2 4
2 2 4 4 8
3 3 9 6 12
4 4 16 8 16
5 5 25 10 20
I want to have it like this:
a b
1 1 1
2 2 4
3 2 4
4 3 9
5 4 8
6 4 16
7 5 25
8 6 12
9 8 16
10 10 20
My case
This is the example in my case. I have a lot of columns with different names and I want to extract abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new data frame, with all abs_dist in one column and all mean_vel in one column, but still connected.
I tried with unlist, but then of course the connection gets lost.
Thanks in advance.
A base R option using reshape
subset(
reshape(
setNames(df1, gsub("(\\d)", ".\\1", names(df1))),
direction = "long",
varying = 1:ncol(df1)
),
select = -c(time, id)
)
gives
a b
1.1 1 1
2.1 2 4
3.1 3 9
4.1 4 16
5.1 5 25
1.2 2 4
2.2 4 8
3.2 6 12
4.2 8 16
5.2 10 20
An option with pivot_longer from tidyr by specifying the names_sep as a regex lookaround to match between a lower case letter ([a-z]) and a digit in the column names
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])") %>%
select(-grp)
-output
# A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 1 1
# 2 2 4
# 3 2 4
# 4 4 8
# 5 3 9
# 6 6 12
# 7 4 16
# 8 8 16
# 9 5 25
#10 10 20
With the edited post, we need to change the names_sep i.e. the delimiter is now _ between a lower case letter and a digit
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])_(?=[0-9])") %>%
select(-grp)
or with base R, use split.default on the substring of column names into a list of data.frame, then unlist each list element by looping over the list and convert to data.frame
data.frame(lapply(split.default(df1, sub("\\d+", "", names(df1))),
unlist, use.names = FALSE))
For the sake of completeness, here is a solution which uses data.table::melt() and the patterns() function to specify columns which belong together:
library(data.table)
melt(setDT(df1), measure.vars = patterns(a = "a", b = "b"))[
order(a,b), !"variable"]
a b
1: 1 1
2: 2 4
3: 2 4
4: 3 9
5: 4 8
6: 4 16
7: 5 25
8: 6 12
9: 8 16
10: 10 20
This reproduces the expected result for OP's sample dataset.
A more realistic example: reshape only selected columns
With the edit of the question, the OP has clarifified that the production data contains many more columns than those which need to be reshaped:
I have a lot of columns with different names and I want to extract
abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new
data frame, with all abs_dist in one column and all mean_vel in one
column, but still connected.
So, the OP wants to extract and reshape the columns of interest in one go while ignoring all other data in the dataset.
To simulate this situation, we need a more elaborate dataset which includes other columns as well:
df2 <- cbind(df1, c1 = 11:15, c2 = 21:25)
df2
a1 b1 a2 b2 c1 c2
1 1 1 2 4 11 21
2 2 4 4 8 12 22
3 3 9 6 12 13 23
4 4 16 8 16 14 24
5 5 25 10 20 15 25
With a modified version of the code above
library(data.table)
cols <- c("a", "b")
result <- melt(setDT(df2), measure.vars = patterns(cols), value.name = cols)[, ..cols]
setorderv(result, cols)
result
we get
a b
1: 1 1
2: 2 4
3: 3 9
4: 4 16
5: 5 25
6: 2 4
7: 4 8
8: 6 12
9: 8 16
10: 10 20
For the production dataset as pictured in the edit, the OP needs to set
cols <- c("abs_dist", "mean_vel")
I would like to create a new data from my existing data frame "ab". The new data frame should look like "Newdf".
a<- c(1:5)
b<-c(11:15)
ab<-data.frame(C1=a,c2=b)
ab
df<-c(1,11,2,12,3,13,4,14,5,15)
CMT<-c(1:2)
CMT1<-rep.int(CMT,times=5)
Newdf<-data.frame(DV=df,Comp=CMT1)
Newdf
Can we use dplyr package? If yes, how?
More importantly than dplyr, you'd need tidyr:
library(tidyr)
library(dplyr)
ab %>%
gather(Comp, DV) %>%
mutate(Comp = recode(Comp, "C1" = 1, "c2" = 2))
# Comp DV
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 11
# 7 2 12
# 8 2 13
# 9 2 14
# 10 2 15
Using dplyr and tidyr gives you something close...
library(tidyr)
library(dplyr)
df2 <- ab %>%
mutate(Order=1:n()) %>%
gather(key=Comp,value=DV,C1,c2) %>%
arrange(Order) %>%
mutate(Comp=recode(Comp,"C1"=1,"c2"=2)) %>%
select(DV,Comp)
df2
DV Comp
1 1 1
2 11 2
3 2 1
4 12 2
5 3 1
6 13 2
7 4 1
8 14 2
9 5 1
10 15 2
Although the OP has asked for a dpylr solution, I felt challenged to look for a data.table solution. So, FWIW, here is an alternative approach using melt().
Note that this solution does not depend on specific column names in ab as the two other dplyr solutions do. In addition, it should be working for more than two columns in ab as well (untested).
library(data.table)
melt(setDT(ab, keep.rownames = TRUE), id.vars = "rn", value.name = "DV"
)[, Comp := rleid(variable)
][order(rn)][, c("rn", "variable") := NULL][]
# DV Comp
# 1: 1 1
# 2: 11 2
# 3: 2 1
# 4: 12 2
# 5: 3 1
# 6: 13 2
# 7: 4 1
# 8: 14 2
# 9: 5 1
#10: 15 2
Data
ab <- structure(list(C1 = 1:5, c2 = 11:15), .Names = c("C1", "c2"),
row.names = c(NA, -5L), class = "data.frame")
ab
# C1 c2
#1 1 11
#2 2 12
#3 3 13
#4 4 14
#5 5 15
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
How can we generate unique id numbers within each group of a dataframe? Here's some data grouped by "personid":
personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23
I wish to add an id column with a unique value for each row within each subset defined by "personid", always starting with 1. This is my desired output:
personid date measurement id
1 x 23 1
1 x 32 2
2 y 21 1
3 x 23 1
3 z 23 2
3 y 23 3
I appreciate any help.
Some dplyr alternatives, using convenience functions row_number and n.
library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
You may also use getanID from package splitstackshape. Note that the input dataset is returned as a data.table.
getanID(data = df, id.vars = "personid")
# personid date measurement .id
# 1: 1 x 23 1
# 2: 1 x 32 2
# 3: 2 y 21 1
# 4: 3 x 23 1
# 5: 3 z 23 2
# 6: 3 y 23 3
The misleadingly named ave() function, with argument FUN=seq_along, will accomplish this nicely -- even if your personid column is not strictly ordered.
df <- read.table(text = "personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23", header=TRUE)
## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3
## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Using data.table, and assuming you wish to order by date within the personid subset
library(data.table)
DT <- data.table(Data)
DT[,id := order(date), by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 3
## 6: 3 y 23 2
If you wish do not wish to order by date
DT[, id := 1:.N, by = personid]
## personid date measurement id
## 1: 1 x 23 1
## 2: 1 x 32 2
## 3: 2 y 21 1
## 4: 3 x 23 1
## 5: 3 z 23 2
## 6: 3 y 23 3
Any of the following would also work
DT[, id := seq_along(measurement), by = personid]
DT[, id := seq_along(date), by = personid]
The equivalent commands using plyr
library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
I think there's a canned command for this, but I can't remember it. So here's one way:
> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
[1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
[1] 1 1 2 2 3 4 5 6 7 8
This works because duplicated returns a logical vector. cumsum evalues numeric vectors, so the logical gets coerced to numeric.
You can store the result to your data.frame as a new column if you want:
dat$id <- cumsum(duplicated(test))+1
Assuming your data are in a data.frame named Data, this will do the trick:
# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
You can use sqldf
df<-read.table(header=T,text="personid date measurement
1 x 23
1 x 32
2 y 21
3 x 23
3 z 23
3 y 23")
library(sqldf)
sqldf("SELECT a.*, COUNT(*) count
FROM df a, df b
WHERE a.personid = b.personid AND b.ROWID <= a.ROWID
GROUP BY a.ROWID"
)
# personid date measurement count
#1 1 x 23 1
#2 1 x 32 2
#3 2 y 21 1
#4 3 x 23 1
#5 3 z 23 2
#6 3 y 23 3
I have a data frame with 2 columns like this:
cond val
1 5
2 18
2 18
2 18
3 30
3 30
I want to change values in val in this way:
cond val
1 5 # 5 = 5/1 (only "1" in cond column)
2 6 # 6 = 18/3 (there are three "2" in cond column)
2 6
2 6
3 15 # 15 = 30/2
3 15
How to achieve this?
A base R solution:
# method 1:
mydf$val <- ave(mydf$val, mydf$cond, FUN = function(x) x = x/length(x))
# method 2:
mydf <- transform(mydf, val = ave(val, cond, FUN = function(x) x = x/length(x)))
which gives:
cond val
1 1 5
2 2 6
3 2 6
4 2 6
5 3 15
6 3 15
Here's the dplyr way:
library(dplyr)
df %>%
group_by(cond) %>%
mutate(val = val / n())
Which gives:
#Source: local data frame [6 x 2]
#Groups: cond [3]
#
# cond val
# (int) (dbl)
#1 1 5
#2 2 6
#3 2 6
#4 2 6
#5 3 15
#6 3 15
The idea is to divide val by the number of observations in the current group (cond) using n()
This seems like an appropriate situation for data.table:
library(data.table)
(dt <- data.table(df)[,val := val / .N, by = cond][])
# cond val
# 1: 1 5
# 2: 2 6
# 3: 2 6
# 4: 2 6
# 5: 3 15
# 6: 3 15
df <- read.table(
text = "cond val
1 5
2 18
2 18
2 18
3 30
3 30",
header = TRUE,
colClasses = "numeric"
)
In base R
df$result = df$val / ave(df$cond, df$cond, FUN = length)
The ave() divides up the cond column by its unique values and takes the length of each subvector, i.e., the denominator you ask for.
Here is a base R answer that will work if cond is an ID variable:
# get length of repeats
temp <- rle(df$cond)
temp <- data.frame(cond=temp$values, lengths=temp$lengths)
# merge onto data.frame
df <- merge(df, temp, by="cond")
df$valNew <- df$val / df$lengths