I have some poorly formatted data that I must work with. It contains two identifiers in the first two rows, followed by the data. The data looks like:
V1 V2 V3
1 Date 12/16/18 12/17/18
2 Equip a b
3 x1 1 2
4 x2 3 4
5 x3 5 6
I want to gather the data to make it tidy, but gathering only works when you have single column names. I've tried looking at spreading as well. The only solutions I've come up with are very hacky and don't feel right. Is there an elegant way to deal with this?
Here's what I want:
Date Equip metric value
1 12/16/18 a x1 1
2 12/16/18 a x2 3
3 12/16/18 a x3 5
4 12/17/18 b x1 2
5 12/17/18 b x2 4
6 12/17/18 b x3 6
This approach gets me close, but I don't know how to deal with the poor formatting (no header, no row names). It should be easy to gather if the formatting was proper.
> as.data.frame(t(df))
V1 V2 V3 V4 V5
V1 Date Equip x1 x2 x3
V2 12/16/18 a 1 3 5
V3 12/17/18 b 2 4 6
And here's the dput
structure(list(V1 = c("Date", "Equip", "x1", "x2", "x3"), V2 = c("12/16/18",
"a", "1", "3", "5"), V3 = c("12/17/18", "b", "2", "4", "6")), class = "data.frame", .Names = c("V1",
"V2", "V3"), row.names = c(NA, -5L))
Thanks for posting a nicely reproducible question. Here's some gentle tidyr/dplyr massaging.
library(tidyr)
df %>%
gather(key = measure, value = value, -V1) %>%
spread(key = V1, value = value) %>%
dplyr::select(-measure) %>%
gather(key = metric, value = value, x1:x3) %>%
dplyr::arrange(Date, Equip, metric)
#> Date Equip metric value
#> 1 12/16/18 a x1 1
#> 2 12/16/18 a x2 3
#> 3 12/16/18 a x3 5
#> 4 12/17/18 b x1 2
#> 5 12/17/18 b x2 4
#> 6 12/17/18 b x3 6
Updated for tidyr v1.0.0:
This is just a little bit cleaner syntax with the pivot functions.
df %>%
pivot_longer(cols = -V1) %>%
pivot_wider(names_from = V1) %>%
pivot_longer(cols = matches("x\\d"), names_to = "metric") %>%
dplyr::select(-name)
You can using reshape
library(reshape)
row.names(df) = df$V1
df$V1 = NULL
df = melt(data.frame(t(df)),id.var = c('Date','Equip'))
df[order(df$Date),]
Date Equip variable value
1 12/16/18 a x1 1
3 12/16/18 a x2 3
5 12/16/18 a x3 5
2 12/17/18 b x1 2
4 12/17/18 b x2 4
6 12/17/18 b x3 6
Here's another way starting from your approach using t(). We can replace the headers from the first row and then drop the first row, allowing just a single gather which might be more intuitive.
library(tidyverse)
df <- structure(list(V1 = c("Date", "Equip", "x1", "x2", "x3"), V2 = c(
"12/16/18",
"a", "1", "3", "5"
), V3 = c("12/17/18", "b", "2", "4", "6")), class = "data.frame", .Names = c(
"V1",
"V2", "V3"
), row.names = c(NA, -5L))
df %>%
t() %>%
`colnames<-`(.[1, ]) %>%
`[`(-1, ) %>%
as_tibble() %>%
gather("metric", "value", x1:x3) %>%
arrange(Date, Equip, metric)
#> # A tibble: 6 x 4
#> Date Equip metric value
#> <chr> <chr> <chr> <chr>
#> 1 12/16/18 a x1 1
#> 2 12/16/18 a x2 3
#> 3 12/16/18 a x3 5
#> 4 12/17/18 b x1 2
#> 5 12/17/18 b x2 4
#> 6 12/17/18 b x3 6
Created on 2018-04-20 by the reprex package (v0.2.0).
Related
Is it possible to use group_by to group one variable and count the target variable based on another variable?
For example,
x1
x2
x3
A
1
0
B
2
1
C
3
0
B
1
1
A
1
1
I want to count 0 and 1 of x3 with grouped x1
x1
x3=0
x3=1
A
1
1
B
0
2
C
1
0
Is it possible to use group_by and add something to summarize? I tried group_by both x1 and x3, but that gives x3 as the second column which is not what we are looking for.
If it's not possible to just use group_by, I was thinking we could group_by both x1 and x3, then split by x3 and cbind them, but the two dataframes after split have different lengths of rows, and there's no cbind_fill. What should I do to cbind them and fill the extra blanks?
using the data.table package:
library(data.table)
dat <- as.data.table(dataset)
dat[, x3:= paste0("x3=", x3)]
result <- dcast(dat, x1~x3, value.var = "x3", fun.aggregate = length)
A tidyverse approach to achieve your desired result using dplyr::count + tidyr::pivot_wider:
library(dplyr)
library(tidyr)
df %>%
count(x1, x3) %>%
pivot_wider(names_from = "x3", values_from = "n", names_prefix = "x3=", values_fill = 0)
#> # A tibble: 3 × 3
#> x1 `x3=0` `x3=1`
#> <chr> <int> <int>
#> 1 A 1 1
#> 2 B 0 2
#> 3 C 1 0
DATA
df <- data.frame(
x1 = c("A", "B", "C", "B", "A"),
x2 = c(1L, 2L, 3L, 1L, 1L),
x3 = c(0L, 1L, 0L, 1L, 1L)
)
Yes, it is possible. Here is an example:
dat = read.table(text = "x1 x2 x3
A 1 0
B 2 1
C 3 0
B 1 1
A 1 1", header = TRUE)
dat %>% group_by(x1) %>%
count(x3) %>%
pivot_wider(names_from = x3,
names_glue = "x3 = {x3}",
values_from = n) %>%
replace(is.na(.),0)
# A tibble: 3 x 3
# Groups: x1 [3]
# x1 `x3 = 0` `x3 = 1`
# <chr> <int> <int>
#1 A 1 1
#2 B 0 2
#3 C 1 0
I have two data frames, df1 and df2, that look as follows:
df1<- data.frame(year, week, X1, X2)
df1
year week X1 X2
1 2010 1 2 3
2 2010 2 8 6
3 2011 1 7 5
firm<-c("X1", "X1", "X2")
year <- c(2010,2010,2011)
week<- c(1, 2, 1)
cost<-c(10,30,20)
df2<- data.frame(firm,year, week, cost)
df2
firm year week cost
1 X1 2010 1 10
2 X1 2010 2 30
3 X2 2011 1 20
I'd like to merge these so the final result (i.e. df3) looks as follows:
df3
firm year week cost Y
1 X1 2010 1 10 2
2 X1 2010 2 30 8
3 X2 2011 1 20 5
Where "Y" is a new variable that reflects the values of X1 and X2 for a particular year and week found in df1.
Is there a way to do this in R? Thank you in advance for your reply.
We can reshape the first dataset to 'long' format and then do a join with the second data
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = X1:X2, values_to = 'Y', names_to = 'firm') %>%
right_join(df2)
-output
# A tibble: 3 x 5
# year week firm Y cost
# <dbl> <dbl> <chr> <int> <dbl>
#1 2010 1 X1 2 10
#2 2010 2 X1 8 30
#3 2011 1 X2 5 20
data
df1 <- structure(list(year = c(2010L, 2010L, 2011L), week = c(1L, 2L,
1L), X1 = c(2L, 8L, 7L), X2 = c(3L, 6L, 5L)), class = "data.frame",
row.names = c("1",
"2", "3"))
df2 <- structure(list(firm = c("X1", "X1", "X2"), year = c(2010, 2010,
2011), week = c(1, 2, 1), cost = c(10, 30, 20)), class = "data.frame",
row.names = c(NA,
-3L))
Here is a base R option (borrow data from #akrun, thanks!)
q <- startsWith(names(df1),"X")
v <- cbind(df1[!q],stack(df1[q]),row.names = NULL)
df3 <- merge(setNames(v,c(names(df1)[!q],"Y","firm")),df2)
which gives
> df3
year week firm Y cost
1 2010 1 X1 2 10
2 2010 2 X1 8 30
3 2011 1 X2 5 20
I have a data frame like this
df <- data.frame(id = 1:4,
V1 = c("A", NA, "C", NA),
V2 = c(NA, NA, NA, "E"),
V3 = c(NA, "B", NA, "F"),
V4 = c(NA, NA, "D", NA), stringsAsFactors = F)
# id V1 V2 V3 V4
# 1 1 A <NA> <NA> <NA>
# 2 2 <NA> <NA> B <NA>
# 3 3 C <NA> <NA> D
# 4 4 <NA> E F <NA>
How can I extract non-missing elements by rows and stack them into a column? My expected output is:
# id value
# 1 1 A
# 2 2 B
# 3 3 C
# 4 3 D
# 5 4 E
# 6 4 F
Try pivot_longer() or unite() + separate_rows().
library(tidyr)
library(dplyr)
# Method 1
df %>%
pivot_longer(-id, values_drop_na = T) %>%
select(-name)
# Method 2
df %>%
unite(value, -id, na.rm = T) %>%
separate_rows(value)
# # A tibble: 6 x 2
# id value
# <int> <chr>
# 1 1 A
# 2 2 B
# 3 3 C
# 4 3 D
# 5 4 E
# 6 4 F
You can use dplyr and tidyr:
df %>%
tidyr::gather(-id, key = "key", value = "value") %>%
dplyr::filter(!is.na(value))
id key value
1 1 V1 A
2 3 V1 C
3 4 V2 E
4 2 V3 B
5 4 V3 F
6 3 V4 D
One base R solution could be:
na.omit(data.frame(df[1], stack(df[-1])[1]))
id values
1 1 A
3 3 C
8 4 E
10 2 B
12 4 F
15 3 D
How about combining complete.cases with reshape library?
library(reshape2)
df.temp <- melt(df, id.vars = "id")
df.temp[complete.cases(df.temp),-2]
results in
id value
1 1 A
3 3 C
8 4 E
10 2 B
12 4 F
15 3 D
pivot_longer then filter
library(tidyverse)
df <- data.frame(id = 1:4,
V1 = c("A", NA, "C", NA),
V2 = c(NA, NA, NA, "E"),
V3 = c(NA, "B", NA, "F"),
V4 = c(NA, NA, "D", NA), stringsAsFactors = FALSE)
df %>% pivot_longer(-id, names_to = "name", values_to = "value") %>%
filter(!is.na(value)) %>%
select(-name)
#> # A tibble: 6 x 2
#> id value
#> <int> <chr>
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 3 D
#> 5 4 E
#> 6 4 F
Created on 2020-03-02 by the reprex package (v0.3.0)
I have a similar problem than the following, but the solution presented in the following link does not work for me:
tidyr spread does not aggregate data
I have a df in the following structure:
UndesiredIndex DesiredIndex DesiredRows Result
1 x1A x1 A 50,32
2 x1B x2 B 7,34
3 x2A x1 A 50,33
4 x2B x2 B 7,35
Using the code below:
dftest <- bd_teste %>%
select(-UndesiredIndex) %>%
spread(DesiredIndex, Result)
I expected the following result:
DesiredIndex A B
A 50,32 50,33
B 7,34 7,35
Although, I keep getting the following result:
DesiredIndex x1 x2
1 A 50.32 NA
2 B 7.34 NA
3 A NA 50.33
4 B NA 7.35
PS: Sometimes I force the column UndesiredIndex out with select(-UndesiredIndex), but I keep getting the following message:
Adding missing grouping variables: UndesiredIndex
Might be something easy to stack those rows, but I'm new to R and have been trying so hard to solve this but without success.
Thanks in advance!
We group by DesiredIndex, create a sequence column and then do the spread:
library(tidyverse)
df1 %>%
select(-UndesiredIndex) %>%
group_by(DesiredIndex) %>%
mutate(new = LETTERS[row_number()]) %>%
ungroup %>%
select(-DesiredIndex) %>%
spread(new, Result)
# A tibble: 2 x 3
# DesiredRows A B
# <chr> <chr> <chr>
#1 A 50,32 50,33
#2 B 7,34 7,35
Data
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
Shorter, but more theoretically round-about.
Data
(Thanks to #akrun!)
df1 <- structure(
list(
UndesiredIndex = c("x1A", "x1B", "x2A", "x2B"),
DesiredIndex = c("x1", "x2", "x1", "x2"),
DesiredRows = c("A", "B", "A", "B"),
Result = c("50,32", "7,34", "50,33", "7,35")
),
class = "data.frame",
row.names = c("1", "2", "3", "4")
)
This is a great technique for concatenating rows.
df1 %>%
group_by(DesiredRows) %>%
summarise(Result = paste(Result, collapse = "|")) %>% #<Concatenate rows
separate(Result, into = c("A", "B"), sep = "\\|") #<Separate by '|'
#> # A tibble: 2 x 3
#> DesiredRows A B
#> <chr> <chr> <chr>
#> 1 A 50,32 50,33
#> 2 B 7,34 7,35
Created on 2018-08-06 by the reprex package (v0.2.0).
I have the following dataframe:
df <- structure(list(x1 = 2:5, x2 = c("zz", "333.iv", "333.i.v", "333(100ug)"
)), .Names = c("x1", "x2"), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
df
#> x1 x2
#> 1 2 zz
#> 2 3 333.iv
#> 3 4 333.i.v
#> 4 5 333(100ug)
For column x2, what I want to do is to rename all values with 333 into
3-33 resulting in:
x1 x2
2 zz
3 3-33
4 3-33
5 3-33
How can I do that?
What about this:
df$x2[grepl('333', df$x2, fixed = TRUE)] <- '3-33'
# > df
# x1 x2
# 1 2 zz
# 2 3 3-33
# 3 4 3-33
# 4 5 3-33
With dplyr:
df %>%
mutate(x2 = ifelse(grepl('333', x2, fixed = TRUE), '3-33', x2))