Reducing multiple rows to 1 by index in R [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I am relatively new to R. I am working with a dataset that has multiple datapoints per timestamp, but they are in multiple rows. I am trying to make a single row for each timestamp with a columns for each variable.
Example dataset
Time Variable Value
10 Speed 10
10 Acc -2
10 Energy 10
15 Speed 9
15 Acc -1
20 Speed 9
20 Acc 0
20 Energy 2
I'd like to convert this to
Time Speed Acc Energy
10 10 -2 10
15 9 -1 (blank or N/A)
20 8 0 2
These are measured values so they are not always complete.
I have tried ddply to extract each individual value into an array and recombine, but the columns are different lengths. I have tried aggregate, but I can't figure out how to keep the variable and value linked. I know I could do this with a for loop type solution, but that seems a poor way to do this in R. Any advice or direction would help. Thanks!

I assume data.frame's name is df
library(tidyr)
spread(df,Variable,Value)

Typically a job for dcast in reshape2.First, we make your example reproducible:
df <- structure(list(Time = c(10L, 10L, 10L, 15L, 15L, 20L, 20L, 20L),
Variable = structure(c(3L, 1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("Acc",
"Energy", "Speed"), class = "factor"), Value = c(10L, -2L, 10L,
9L, -1L, 9L, 0L, 2L)), .Names = c("Time", "Variable", "Value"),
class = "data.frame", row.names = c(NA, -8L))
Then:
library(reshape2)
dcast(df, Time ~ ...)
Time Acc Energy Speed
10 -2 10 10
15 -1 NA 9
20 0 2 9
With dplyr you can (cosmetics) reorder the columns with:
library(dplyr)
dcast(df, Time ~ ...) %>% select(Time, Speed, Acc, Energy)
Time Speed Acc Energy
10 10 -2 10
15 9 -1 NA
20 9 0 2

Related

Pivoting data frame in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 9 months ago.
I have a data frame that looks like the following:
Day
Minutes
Status
1
0
Play
1
10
Eat
1
30
Move
1
50
Transport
2
0
Play
2
20
Transport
2
50
Sleep
Is it possible to pivot the table to have my Day as an index while the column names are the status and the values are the minutes?
Desired Output:
Day
Play
Eat
Move
Transport
Play
Transport
Sleep
1
2
You can use pivot_wider from tidyr (part of the tidyverse). You can supply the new column names using names_from, then you want to fill in the values with the data from Minutes.
library(tidyverse)
df %>%
pivot_wider(names_from = "Status", values_from = "Minutes")
Output
Day Play Eat Move Transport Sleep
<int> <int> <int> <int> <int> <int>
1 1 0 10 30 50 NA
2 2 0 NA NA 20 50
Data
df <- structure(list(Day = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), Minutes = c(0L,
10L, 30L, 50L, 0L, 20L, 50L), Status = c("Play", "Eat", "Move",
"Transport", "Play", "Transport", "Sleep")), class = "data.frame", row.names = c(NA,
-7L))

group two variables(in rows) in R to create one variable [duplicate]

This question already has answers here:
How to merge multiple rows by a given condition and sum?
(2 answers)
Closed 2 years ago.
I have a data frame where
Disease Genemutation Mean. Total No of pateints No.of pateints.
cancertype1 BRCA1 1 10 2
cancertype2 BRCA2 5 10 3
cancertype3 BRCA2 7 10 4
cancertype1 BRCA1 8 10 1
cancertype3 BRCA2 4 10 4
cancertype2 BRCA1 6 10 1
how do I create an new variable called cancertype 4 (from cancer type 3 and cancer type 2) that includes the number of patients that have it as a result of merging the two variable?
We can use replace with %in% to replace those values (assuming 'Disease' is character class)
df1 %>%
group_by(Disease = replace(Disease,
Disease %in% c("cancertype2", "cancertype3"), "cancertype4")) %>%
summarise(TotalNoofpateints = sum(TotalNoofpateints))
-output
# A tibble: 2 x 2
# Disease TotalNoofpateints
# <chr> <int>
#1 cancertype1 20
#2 cancertype4 40
Here is a base R option using aggregate
aggregate(
Total.No.of.pateints ~ Disease,
transform(
df,
Disease = replace(Disease, Disease %in% c("cancertype2", "cancertype3"), "cancertype4")
),
sum
)
giving
Disease Total.No.of.pateints
1 cancertype1 20
2 cancertype4 40
Data
> dput(df)
structure(list(Disease = c("cancertype1", "cancertype2", "cancertype3",
"cancertype1", "cancertype3", "cancertype2"), Genemutation = c("BRCA1",
"BRCA2", "BRCA2", "BRCA1", "BRCA2", "BRCA1"), Mean. = c(1L, 5L,
7L, 8L, 4L, 6L), Total.No.of.pateints = c(10L, 10L, 10L, 10L,
10L, 10L), No.of.pateints. = c(2L, 3L, 4L, 1L, 4L, 1L)), class = "data.frame", row.names = c(NA,
-6L))

Two histograms with two variables in Ggplot2

This is my DF :
> head(xgb_1_plot)
week PRICE id_item food_cat_id test_label xgb_1
2 5 18 60 7 2 2
7 5 21 9 6 5 8
12 5 14 31 4 4 6
21 5 15 25 7 12 12
31 5 14 76 3 4 2
36 5 7 48 8 2 4
Where test_label is the test value, "xgb_1" is the column with the predicted values and id_items are the items.
I want to plot graph in which I can see predicted values VS true values side by side for some id_items.
There are over 100, so I need just a subset for the plot (otherwise it'll be a mess).
Let me know!
P.S. the best thing would be transform the test_label and the xgb1 in rows and add a dummy variable "Predicted/True value", but I have no idea how to do it.
I would suggest this approach, reshaping data and then plotting. Having more data, it will look better:
library(tidyverse)
#Data
dfa <- structure(list(id_item = c(60L, 9L, 31L, 25L, 76L, 48L), test_label = c(2L,
5L, 4L, 12L, 4L, 2L), xgb_1 = c(2L, 8L, 6L, 12L, 2L, 4L)), class = "data.frame", row.names = c("2",
"7", "12", "21", "31", "36"))
Code:
#Reshape
dfa %>% pivot_longer(cols = -id_item) %>%
ggplot(aes(x=value,fill=name))+
geom_histogram(position = position_dodge())+
facet_wrap(.~id_item)
Output:
Here's a differnt approach using geom_errorbar. Maybe the color thing is a little bit too much, but today is a rainy day ... so was in need of some variety
"%>%" <- magrittr::"%>%"
dat <- dplyr::tibble(id_item=c(69,9,31,25,76,48),
test_label=c(2,5,4,12,4,2),
xgb_1=c(2,8,6,21,2,4))
dat %>%
dplyr::mutate(diff=abs(test_label-xgb_1)) %>%
ggplot2::ggplot(ggplot2::aes(x=id_item,ymin=test_label,ymax=xgb_1,color=diff)) +
ggplot2::geom_errorbar()

Removing rows when some values match and some do not [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
ID Amount Previous
1 10 15
1 10 13
2 20 18
2 20 24
3 5 7
3 5 6
I want to remove the duplicate rows from the following data frame, where ID and Amount match. Values in the Previous column do not match. When deciding which row to take, I'd like to take the one where the Previous column value is higher.
This would look like:
ID Amount Previous
1 10 15
2 20 24
3 5 7
An option is distinct on the columns 'ID', 'Amount' (after arrangeing the dataset) while specifying the .keep_all = TRUE to get all the other columns that correspond to the distinct elements in those columns
library(dplyr)
df1 %>%
arrange(ID, Amount, desc(Previous)) %>%
distinct(ID, Amount, .keep_all = TRUE)
# ID Amount Previous
#1 1 10 15
#2 2 20 24
#3 3 5 7
Or with duplicated from base R applied on the 'ID', 'Amount' to create a logical vector and use that to subset the rows of the dataset
df2 <- df1[with(df1, order(ID, Amount, -Previous)),]
df2[!duplicated(df2[c('ID', 'Amount')]),]
# ID Amount Previous
#1 1 10 15
#3 2 20 24
#5 3 5 7
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), Amount = c(10L,
10L, 20L, 20L, 5L, 5L), Previous = c(15L, 13L, 18L, 24L, 7L,
6L)), class = "data.frame", row.names = c(NA, -6L))

Can reshape in base R turn more than one time var into a single columns in long format?

I can reshape the part of my columns having the same 'name stem' opg.1 through opg.10, but when I present the last two 'time' variables, mkd.1 and mkd.2, I get the following error:
Fejl i reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
'varying' arguments must be the same length
In short, my question is, will renaming mkd.1 and mkd.2 to have the same opg name stem remove the error, and in case it will, why?
My code is
gdata <- termin.test[sel.cols]
names(gdata) <- c( "opg.1", "opg.2","opg.3","opg.4","opg.5", "opg.61",
"opg.62","opg.7","opg.8","opg.9","opg.10",
"navn",
"mkd.11","mkd.12" )
head(gdata)
# opg.1 opg.2 opg.3 opg.4 opg.5 opg.61 opg.62 opg.7 opg.8 opg.9 opg.10
# 1 2 2 0 0 1 0 10 4 5 10 3
# 2 0 1 0 0 2 2 5 5 2 8 1
# 3 1 0 0 0 0 0 7 3 3 7 4
# 4 0 0 0 0 0 2 7 4 8 10 7
# 5 8 2 3 4 7 3 11 12 10 8 16
# 6 1 2 1 1 2 2 5 2 2 3 6
# navn mkd.11 mkd.12
# 1 Czzzzzzz 5 24
# 2 Xxxxxx A 2 16
# 3 Cccccc B 1 17
# 4 Christian 0 26
# 5 Emil Xxxx 16 33
# 6 Aaaaa-Sss 4 11
So far so good. But here, my varying= parameter turns me down.
I wanted the variables opg.1-opg.10 and the final two mkd.11 and mkd.12.
redata <- reshape(
# De første 11 kolonner er opgave-kryd-optællinger + nr 12: Elevens navn
gdata, # [,1:12],
direction = "long",
varying=c(1:11,13,14), # Works problem free with varying = 1:11
timevar = "opgave", #
# Vektor OPGAVER er defineret med opgavenavne ovenfor ????
times = opgaver
)
I have a hypothesis that it will work to rename mkd.11 -> opg.11. But I post the question, because I would like to (1) get into base R and (2) comprehend what I am doing. I looked up the question What code does a task like the reshape2 package in a base reshape function? but did not find neither a matching problem posed nor answers relevant to my question.
Edit
Rephrasing the question as I need a single numerical column in the long format reshaped data frame.
The reshape function needs the "varying" argument to have balanced and consistent names. opg has 11 items and mkd has only 2.
I need a single numerical column in the long format reshaped data
frame.
Then rename the two mkd variables to opg.11 and opg.12 before reshaping (as you did).
names(gdata)[13:14] <- c("opg.11","opg.12")
reshape(gdata,
direction = "long",
varying=c(1:11,13,14),
timevar = "opgave"
) # we don't have your `opgaver` object
navn opgave opg id
1.1 Czzzzzzz 1 2 1
2.1 Xxxxxx A 1 0 2
3.1 Cccccc B 1 1 3
4.1 Christian 1 0 4
5.1 Emil Xxxx 1 8 5
6.1 Aaaaa-Sss 1 1 6
...
1.12 Czzzzzzz 12 24 1
2.12 Xxxxxx A 12 16 2
3.12 Cccccc B 12 17 3
4.12 Christian 12 26 4
5.12 Emil Xxxx 12 33 5
6.12 Aaaaa-Sss 12 11 6
If your output is a boxplot, then modify the labels in the command to draw it, or you can convert the opgave variable into a factor with the appropriate labels.
I can see that when I use the suggestion of #akrun, I get two columns opg and mkd in the reshaped data frame, and as there are 11 opg-columns and only 2 mkd-cols the reason for the mentinoed error message is evident: in my data set, I end up with
> melt(setDT(gdata), measure = patterns('^opg\\.\\d+$', '^mkd\\.\\d+$'),
+ value.name = c('opg', 'mkg'), variable.name = 'opgave')
# navn opgave opg mkg
# 1: Czzzzzzz 1 2 5
# 2: Caroline Cxxxx 1 0 2
# 3: Crrrrrrr Rrrrr 1 1 1
# 4: Christian 1 0 0
# 5: Emil Zzzz Cccc 1 8 16
# ---
#238: Owiler 11 8 NA
#239: Sarah 11 5 NA
#240: Bang Bang 11 10 NA
#241: Thhhhh 11 2 NA
#242: William B 11 6 NA
The NA values in the mkg column show that there are fewer variables of this type. This is not as intended. Therefore I stick to the same-name-stem option:
gdata <- termin.test[sel.cols]
names(gdata) <- c( "opg.1", "opg.2","opg.3","opg.4","opg.5", "opg.61",
"opg.62","opg.7","opg.8","opg.9","opg.10",
"navn",
"opg.11","opg.12" )
redata <- reshape(
# De første 11 kolonner er opgave-kryd-optællinger + nr 12: Elevens navn
gdata, # [,1:12],
direction = "long",
varying=c(1:11,13,14), # De første 11 kolonner skal "vendes"
timevar = "opgave", #
# Vektor OPGAVER er defineret med opgavenavne ovenfor ????
times = opgaver
)
This solution works in my further processing in the diagram shown below using with geom_boxplot(), and I can live with the names of the two latter columns, or renaming them in the factored variable opgave is beyond the scope of this question.
If we want to rename the 'mkd' to 'opg'
library(ggplot2)
library(stringr)
library(dplyr)
library(tidyr)
gdata %>%
rename_at(vars(starts_with('mkd')), ~ str_replace(., 'mkd', 'opg')) %>%
pivot_longer(cols = -navn, names_to = 'opgave', values_to = 'value') %>%
ggplot(aes(x =opgave, y = value)) +
geom_boxplot()
data
gdata <- structure(list(opg.1 = c(2L, 0L, 1L, 0L, 8L, 1L), opg.2 = c(2L,
1L, 0L, 0L, 2L, 2L), opg.3 = c(0L, 0L, 0L, 0L, 3L, 1L), opg.4 = c(0L,
0L, 0L, 0L, 4L, 1L), opg.5 = c(1L, 2L, 0L, 0L, 7L, 2L), opg.61 = c(0L,
2L, 0L, 2L, 3L, 2L), opg.62 = c(10L, 5L, 7L, 7L, 11L, 5L), opg.7 = c(4L,
5L, 3L, 4L, 12L, 2L), opg.8 = c(5L, 2L, 3L, 8L, 10L, 2L), opg.9 = c(10L,
8L, 7L, 10L, 8L, 3L), opg.10 = c(3L, 1L, 4L, 7L, 16L, 6L), navn = c("Czzzzzzz",
"Xxxxxx A", "Cccccc B", "Christian", "Emil Xxxx", "Aaaaa-Sss"
), mkd.11 = c(5L, 2L, 1L, 0L, 16L, 4L), mkd.12 = c(24L, 16L,
17L, 26L, 33L, 11L)), class = "data.frame", row.names = c(NA,
-6L))

Resources