R spread dataframe [duplicate] - r

This question already has answers here:
Reshape multiple value columns to wide format
(5 answers)
Closed 2 years ago.
IN R language how to convert
data1 into data2
data1 = fread("
id year cost pf loss
A 2019-02 155 10 41
B 2019-03 165 14 22
B 2019-01 185 34 56
C 2019-02 350 50 0
A 2019-01 310 40 99")
data2 = fread("
id item 2019-01 2019-02 2019-03
A cost 30 155 NA
A pf 40 10 NA
A loss 99 41 NA
B cost 185 NA 160
B pf 34 NA 14
B loss 56 NA 22
C cost NA 350 NA
C pf NA 50 NA
C loss NA 0 NA")
I try to use spread、gather、dplyr、apply..... but .....

First get the data in long format and then get it back in wide.
library(tidyr)
data1 %>%
pivot_longer(cols = cost:loss) %>%
pivot_wider(names_from = year, values_from = value)
Note that gather and spread have been retired and replace by pivot_longer and pivot_wider.
Using data.table :
library(data.table)
dcast(melt(data1, c('id', 'year')), id+variable~year, value.var = 'value')
# id variable 2019-01 2019-02 2019-03
#1: A cost 310 155 NA
#2: A pf 40 10 NA
#3: A loss 99 41 NA
#4: B cost 185 NA 165
#5: B pf 34 NA 14
#6: B loss 56 NA 22
#7: C cost NA 350 NA
#8: C pf NA 50 NA
#9: C loss NA 0 NA

Related

R: Pivot_Wider/spread by obtaining average sorted by year

I've the following dataset
Pet Shop
Year
Item
Price
A
2021
dog
300
A
2021
dog
250
A
2021
fish
20
A
2020
turtle
50
A
2020
dog
250
A
2020
cat
280
A
2019
rabbit
180
A
2019
cat
165
A
2019
cat
270
B
2021
dog
350
B
2021
fish
80
B
2021
fish
70
B
2020
cat
220
B
2020
turtle
90
B
2020
turtle
80
B
2020
fish
55
B
2019
fish
75
C
2021
dog
280
C
2020
cat
260
C
2020
cat
270
C
2019
fish
65
C
2019
cat
270
The code for the data is as follows
Pet_Shop = c(rep("A",9), rep("B",8), rep("C",5))
Item = c("Dog","Dog","Fish","Turtle","Dog","Cat","Rabbit","Cat","Cat","Dog","Fish","Fish","Cat","Turtle","Turtle","Fish","Fish","Dog","Cat","Cat","Fish","Cat")
Price = c(300,250,20,50,250,280,180,165,270,350,80,70,220,90,80,55,75,280,260,270,65,270)
Data = data.frame(Pet_Shop, Item, Price)
Does anyone here know how I can use pivot_wider or spread (or any other method) to achieve the following table? It groups the Shop by year and takes the average of the similar item of the same shop for the year. I've issues incorporating the year.
Pet Shop
Year
dog
fish
turtle
cat
rabbit
A
2021
Average(300,250) = 275
20
NA
NA
NA
A
2020
250
NA
50
280
NA
A
2019
NA
NA
NA
217.5
NA
B
2021
350
75
NA
NA
NA
B
2020
NA
55
85
220
NA
B
2019
NA
75
NA
NA
NA
C
2021
280
NA
NA
NA
NA
C
2020
NA
NA
NA
265
NA
C
2019
NA
60
NA
270
NA
In pivot_wider you may pass a function (values_fn) to be applied to each combination of Pet_Shop and Year.
result <- tidyr::pivot_wider(Data, names_from = Item,
values_from = Price, values_fn = mean)
result
# Pet_Shop Year dog fish turtle cat rabbit
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 2021 275 20 NA NA NA
#2 A 2020 250 NA 50 280 NA
#3 A 2019 NA NA NA 218. 180
#4 B 2021 350 75 NA NA NA
#5 B 2020 NA 55 85 220 NA
#6 B 2019 NA 75 NA NA NA
#7 C 2021 280 NA NA NA NA
#8 C 2020 NA NA NA 265 NA
#9 C 2019 NA 65 NA 270 NA
The same can also be done with data.table dcast -
library(data.table)
dcast(setDT(Data), Pet_Shop + Year ~ Item,
value.var = "Price", fun.aggregate = mean)

Subset a data set based on first non-NA values

I have a data frame like this:
Year S1 S2 S3
1699 1 NA NA
1700 5 23 5
1701 6 1 6
1702 7 13 9
I want to keep only those columns where the first non-NA year is equal or bigger than 1700. In this case, I want to keep columns S2 and S3 but not S1 (since its first non-NA year is 1699).
How can I do this?
You can use Filter :
result <- cbind(df1[1], Filter(function(x)
df1$Year[which.max(!is.na(x))] >= 1700, df1[-1]))
result
# Year S2 S3
#1 1699 NA NA
#2 1700 23 5
#3 1701 1 6
#4 1702 13 9
Using an sapply like this.
d[c(T, sapply(d[-1], function(x) d$Year[!is.na(x)][1]) >= 1700)]
# Year S2 S3
# 1 1699 NA NA
# 2 1700 23 5
# 3 1701 1 6
# 4 1702 13 9
Data
d <- read.table(header=TRUE, text="Year S1 S2 S3
1699 1 NA NA
1700 5 23 5
1701 6 1 6
1702 7 13 9")

How to split a data set with duplicated informations based on date

I have this situation:
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
3 2014-07-11 56
3 NA 34
4 2014-10-05 25
4 2014-08-09 14
5 NA NA
5 NA NA
And I would like split the dataset in this, like this:
1-
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
4 2014-10-05 25
4 2014-08-09 14
2- Lowest Date
ID date Weight
3 2014-07-11 56
3 NA 34
5 NA NA
5 NA NA
I tried this for second dataset:
dt <- dt[order(dt$ID, dt$date), ]
dt.2=dt[duplicated(dt$ID), ]
but didn't work
Get the ID's for which date are NA and then subset based on that
NA_ids <- unique(df$ID[is.na(df$date)])
subset(df, !ID %in% NA_ids)
# ID date Weight
#1 1 2014-12-02 23
#2 1 2014-10-02 25
#3 2 2014-11-03 27
#4 2 2014-09-03 45
#7 4 2014-10-05 25
#8 4 2014-08-09 14
subset(df, ID %in% NA_ids)
# ID date Weight
#5 3 2014-07-11 56
#6 3 <NA> 34
#9 5 <NA> NA
#10 5 <NA> NA
Using dplyr, we can create a new column which has TRUE/FALSE for each ID based on presence of NA and then use group_split to split into list of two.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NA_ID = any(is.na(date))) %>%
ungroup %>%
group_split(NA_ID, keep = FALSE)
The above dplyr logic can also be implemented in base R by using ave and split
df$NA_ID <- with(df, ave(is.na(date), ID, FUN = any))
split(df[-4], df$NA_ID)

Cleaning a data.frame in a semi-reshape/semi-aggregate fashion

First time posting something here, forgive any missteps in my question.
In my example below I've got a data.frame where the unique identifier is the tripID with the name of the vessel, the species code, and a catch metric.
> testFrame1 <- data.frame('tripID' = c(1,1,2,2,3,4,5),
'name' = c('SS Anne','SS Anne', 'HMS Endurance', 'HMS Endurance','Salty Hippo', 'Seagallop', 'Borealis'),
'SPP' = c(101,201,101,201,102,102,103),
'kept' = c(12, 22, 14, 24, 16, 18, 10))
> testFrame1
tripID name SPP kept
1 1 SS Anne 101 12
2 1 SS Anne 201 22
3 2 HMS Endurance 101 14
4 2 HMS Endurance 201 24
5 3 Salty Hippo 102 16
6 4 Seagallop 102 18
7 5 Borealis 103 10
I need a way to basically condense the data.frame so that all there is only one row per tripID as shown below.
> testFrame1
tripID name SPP kept SPP.1 kept.1
1 1 SS Anne 101 12 201 22
2 2 HMS Endurance 101 14 201 24
3 3 Salty Hippo 102 16 NA NA
4 4 Seagallop 102 18 NA NA
5 5 Borealis 103 10 NA NA
I've looked into tidyr and reshape but neither of those are can deliver quite what I'm asking for. Is there anything out there that does this quasi-reshaping?
Here are two alternatives using base::reshape and data.table::dcast:
1) base R
reshape(transform(testFrame1,
timevar = ave(tripID, tripID, FUN = seq_along)),
idvar = cbind("tripID", "name"),
timevar = "timevar",
direction = "wide")
# tripID name SPP.1 kept.1 SPP.2 kept.2
#1 1 SS Anne 101 12 201 22
#3 2 HMS Endurance 101 14 201 24
#5 3 Salty Hippo 102 16 NA NA
#6 4 Seagallop 102 18 NA NA
#7 5 Borealis 103 10 NA NA
2) data.table
library(data.table)
setDT(testFrame1)
dcast(testFrame1, tripID + name ~ rowid(tripID), value.var = c("SPP", "kept"))
# tripID name SPP_1 SPP_2 kept_1 kept_2
#1: 1 SS Anne 101 201 12 22
#2: 2 HMS Endurance 101 201 14 24
#3: 3 Salty Hippo 102 NA 16 NA
#4: 4 Seagallop 102 NA 18 NA
#5: 5 Borealis 103 NA 10 NA
Great reproducible post considering it's your first. Here's a way to do it with dplyr and tidyr -
testFrame1 %>%
group_by(tripID, name) %>%
summarise(
SPP = toString(SPP),
kept = toString(kept)
) %>%
ungroup() %>%
separate("SPP", into = c("SPP", "SPP.1"), sep = ", ", extra = "drop", fill = "right") %>%
separate("kept", into = c("kept", "kept.1"), sep = ", ", extra = "drop", fill = "right")
# A tibble: 5 x 6
tripID name SPP SPP.1 kept kept.1
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1.00 SS Anne 101 201 12 22
2 2.00 HMS Endurance 101 201 14 24
3 3.00 Salty Hippo 102 <NA> 16 <NA>
4 4.00 Seagallop 102 <NA> 18 <NA>
5 5.00 Borealis 103 <NA> 10 <NA>

Transfer pivottable to another table in R

In my research I have a dataset of cancer patients with some clinical information like cancer stage and treatment etc. Each patient has one row in a table with this clinical information. In addition, each patient has, at one or several timepoints during the treatment, taken blood samples, depending on how long the patient has been followed at the clinic. The first sample is from the first visit and the second sample is from the second visit at the clinic, and so on.
In the table, there is a variable (ie. column) that is named Sample_Time_1, which is the time for the first sample. Sample_Time_2 has the time (date) for the second sample and so on.
However - the samples were analysed at the lab and I got the result in a pivottable, which means I have a table where each sample has one row and therefore the results from one patient is displayed on several rows.
For example, create two tables:
x <- c(1,2,2,3,3,3,3,4,5,6,6,6,6,7,8,9,9,10)
y <- as.Date(c("2011-05-17","2012-06-30","2012-08-11","2011-10-15","2011-11-25","2012-01-07","2012-02-15","2011-08-13","2012-02-03","2011-11-08","2011-12-21","2012-02-01","2012-03-12","2012-01-03","2012-04-20","2012-03-31","2012-05-10","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
z <- c(123,185,153,153,125,148,168,187,194,115,165,167,143,151,129,130,151,134)
Sheet_1 <- matrix(c(x,y,z), ncol=3, byrow=FALSE)
colnames(Sheet_1) <- c("ID","Sample_Time", "Sample_Value")
Sheet_1 <- as.data.frame(Sheet_1)
Sheet_1$Sample_Time <- y
x1 <- c(1,2,3,4,5,6,7,8,9,10)
x2 <- c(3,3,2,3,2,2,4,2,3,3)
x3 <- c(1,2,2,3,3,1,3,1,1,2)
x4 <- as.Date(c("2011-05-17","2012-06-30","2011-10-15","2011-08-13","2012-02-03","2011-11-08","2012-01-03","2012-04-20","2012-03-31","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
x5 <- as.Date(c(NA,"2012-08-11","2011-11-25",NA,NA,"2011-12-21",NA,NA,"2012-05-10",NA), format="%Y-%m-%d", origin="1960-01-01")
x6 <- as.Date(c(NA,NA,"2012-01-07",NA,NA,"2012-02-01",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
x7 <- as.Date(c(NA,NA,"2012-02-15",NA,NA,"2012-03-12",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
Sheet_2 <- as.data.frame(c(1:10))
colnames(Sheet_2) <- "ID"
Sheet_2$Stage <- x2
Sheet_2$Treatment <- x3
Sheet_2$Sample_Time_1 <- x4
Sheet_2$Sample_Time_2 <- x5
Sheet_2$Sample_Time_3 <- x6
Sheet_2$Sample_Time_4 <- x7
Sheet_2$Sample_Value_1 <- NA
Sheet_2$Sample_Value_2 <- NA
Sheet_2$Sample_Value_3 <- NA
Sheet_2$Sample_Value_4 <- NA
I would like to transfer the Sample_Value for the first date a sample was taken from a patient from Sheet_1 to Sheet_2$Sample_Value_1 and if there are more samples, I would like to transfer them to column "Sample_Value_2" and so on.
I have tried with a double for-loop. For each patient (=ID) in Sheet_1 I have run through Sheet_2 and if there is a mach on ID, then I use another for-loop to see if there is a mach on a Sample_Time and insert (using if) the Sample_Value. However, I do not manage to get it to work and I have a strong feeling there must be a better way.
Any suggestions?
Is this what you want:
Prepare Sheet_1 for reshaping from long to wide by introducing an extra column with unique ID for each blood sample per patient
Sheet_1$uniqid <- with(Sheet_1, ave(as.character(ID), ID, FUN = seq_along))
And with this, do the re-shaping
S_1 <- reshape( Sheet_1, idvar = "ID", timevar = "uniqid", direction = "wide")
which gives you
> S_1
ID Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2 Sample_Time.3
1 1 2011-05-17 123 <NA> NA <NA>
2 2 2012-06-30 185 2012-08-11 153 <NA>
4 3 2011-10-15 153 2011-11-25 125 2012-01-07
8 4 2011-08-13 187 <NA> NA <NA>
9 5 2012-02-03 194 <NA> NA <NA>
10 6 2011-11-08 115 2011-12-21 165 2012-02-01
14 7 2012-01-03 151 <NA> NA <NA>
15 8 2012-04-20 129 <NA> NA <NA>
16 9 2012-03-31 130 2012-05-10 151 <NA>
18 10 2011-12-15 134 <NA> NA <NA>
Sample_Value.3 Sample_Time.4 Sample_Value.4
1 NA <NA> NA
2 NA <NA> NA
4 148 2012-02-15 168
8 NA <NA> NA
9 NA <NA> NA
10 167 2012-03-12 143
14 NA <NA> NA
15 NA <NA> NA
16 NA <NA> NA
18 NA <NA> NA
The number after the dot in the colnames is the uniqid.
Now you can merge the relevant columns from Sheet_2
S_2 <- merge( Sheet_2[ 1:3 ], S_1, by = "ID" )
and the result should be what you are looking for:
> S_2
ID Stage Treatment Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2
1 1 3 1 2011-05-17 123 <NA> NA
2 2 3 2 2012-06-30 185 2012-08-11 153
3 3 2 2 2011-10-15 153 2011-11-25 125
4 4 3 3 2011-08-13 187 <NA> NA
5 5 2 3 2012-02-03 194 <NA> NA
6 6 2 1 2011-11-08 115 2011-12-21 165
7 7 4 3 2012-01-03 151 <NA> NA
8 8 2 1 2012-04-20 129 <NA> NA
9 9 3 1 2012-03-31 130 2012-05-10 151
10 10 3 2 2011-12-15 134 <NA> NA
Sample_Time.3 Sample_Value.3 Sample_Time.4 Sample_Value.4
1 <NA> NA <NA> NA
2 <NA> NA <NA> NA
3 2012-01-07 148 2012-02-15 168
4 <NA> NA <NA> NA
5 <NA> NA <NA> NA
6 2012-02-01 167 2012-03-12 143
7 <NA> NA <NA> NA
8 <NA> NA <NA> NA
9 <NA> NA <NA> NA
10 <NA> NA <NA> NA

Resources