R reshape wide to long data [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a dataframe like so:
[1] "drugevent" "prr" "prr_lowerCI" "prr_upperCI" "EBGM"
[6] "EBG_lowerCI" "EBGM_upperCI" "strata.coded" "strata" "Reference"
And I want to make a plot for each drugevent, using ggplot. In order to do so I need to format my DF like so:
[1] "drug", "event", "measurement"(prr or EBGM), "lowerCI"(for coresponding measurement), upperCI, strata
But despite the many posts on SO, or R tutorials I was not able to corectly reshape the data. In my last try I added an Id like so:
mutate(DF, count=1:n())
melted the data
melt(DF, id.vars="count")
then I made several DFs subsetting the values of interest
subset(melted, variable in c("prr","EBGM"))
then the upper and lower confidence intervals, strata and drug event,
but when I merged them like so:
merge(measurement, lowerCI, by="count")
in the end I had duplicated values with 4 rows for each count.
The code is messy and the result is wrong. Could you please help me with this?
Edit exampples:
initial data:
drugevent prr prr_lowerCI prr_upperCI
1 CLARITHROMYCIN-Erythema Multiforme 1.3539930 0.1903270 2.517659
2 CLARITHROMYCIN-Erythema Multiforme 1.7741342 0.6647390 2.883529
EBGM EBG_lowerCI EBGM_upperCI strata count
1 0.9003325 0.2128934 2.772558 Infants 1
2 1.4471096 0.5997188 3.053965 Children 2
the desired result:
measurement value upperCI strata drug
1 prr 1.353992979 2.51765895 Infants CLARITHROMYCIN
2 EBGM 0.9009 2.77 Infants CLARITHROMYCIN
reaction lowerCI
1 Erythema Multiforme 2.51765895
2 Erythema Multiforme 1.447

From what I understand you want a long format of the original data frame split based on prr or ebgm
dfPRR <- cbind(df[, !grepl("EBG", colnames(df))], measurement="prr")
colnames(dfPRR)[2:4] <- c("value", "lowerCI", "upperCI")
dfEBGM <- cbind(df[, !grepl("prr", colnames(df))], measurement="EBGM")
colnames(dfEBGM)[2:4] <- c("value", "lowerCI", "upperCI")
rbind(dfPRR, dfEBGM)
Data used
structure(list(drugevent = structure(c(1L, 1L), .Label = "CLARITHROMYCIN-Erythema Multiforme", class = "factor"),
prr = c(1.353993, 1.7741342), prr_lowerCI = c(0.190327, 0.664739
), prr_upperCI = c(2.517659, 2.883529), EBGM = c(0.9003325,
1.4471096), EBG_lowerCI = c(0.2128934, 0.5997188), EBGM_upperCI = c(2.772558,
3.053965), strata = structure(1:2, .Label = c(" Infants",
" Children"), class = "factor"), count = 1:2), .Names = c("drugevent",
"prr", "prr_lowerCI", "prr_upperCI", "EBGM", "EBG_lowerCI", "EBGM_upperCI",
"strata", "count"), class = "data.frame", row.names = c(NA, -2L
))

Related

Are there any functions to subtract all features(rows) to a particular value(row) in the same data file?

I am new to programming in R and Python, however have some basics. I have a technical question about computation. I would like to know if there are any functions for performing subtraction of all features(rows) to a particular value (row) from the same data list. I would like to obtain the output_value1 as shown in the link below and post this, multiply by (-1) to obtain the output_value2.
data file link: https://www.dropbox.com/s/m5rsi6ru419f5bf/Template_matrixfile.xlsx?dl=0
Please let me know if you need more details.
I have tried performing the same operation in the MS Excel, this is very tedious and time consuming.
I have many large datasets with several hundred rows and columns which becomes more complex to manually perform the same in MS Excel. Hence, I would prefer to write a code and obtain the desired outputs.
Here is the example data:Inputs are feature and value columns and outputs are Output_value1, and Output_value2 columns.
|Feature| |Value| |Output_value1| |Output_value2|
|Gene_1| |14.25633934| |0.80100922| |-0.80100922|
|Gene_2| |16.88394578| |3.42861566| |-3.42861566|
|Gene_3| |16.01| |2.55466988| |-2.55466988|
|Gene_4| |13.82329514| |0.36796502| |-0.36796502|
|Gene_5| |12.96382949| |-0.49150063| |0.49150063|
|Normalizer| |13.45533012| |0| |0|
dput(head(Exampledata))
structure(list(Feature = structure(1:6, .Label = c("Gene_1", "Gene_2",
"Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"), Value =
c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012), Output_value1 = c(0.80100922, 3.42861566, 2.55466988,
0.36796502, -0.49150063, 0), Output_value2 = c(-0.80100922,
-3.42861566, -2.55466988, -0.36796502, 0.49150063, 0)), row.names = c(NA, 6L), class = "data.frame")
Assuming you'll only have one row where Feature == "Normalizer", in R you get the Value of that row and subtract it from rest of the rows.
Exampledata$Output_value1 <- Exampledata$Value -
Exampledata$Value[Exampledata$Feature == "Normalizer"]
Exampledata$Output_value2 <- Exampledata$Output_value1 * -1
Exampledata
# Feature Value Output_value1 Output_value2
#1 Gene_1 14.25634 0.8010092 -0.8010092
#2 Gene_2 16.88395 3.4286157 -3.4286157
#3 Gene_3 16.01000 2.5546699 -2.5546699
#4 Gene_4 13.82330 0.3679650 -0.3679650
#5 Gene_5 12.96383 -0.4915006 0.4915006
#6 Normalizer 13.45533 0.0000000 0.0000000
EDIT
For multiple such columns, we can do
cols <- grep("^Value", names(data))
inds <- which(data$Feature == "Normalizer")
data[paste0("Output", seq_along(cols))] <- data[cols] - data[rep(inds, nrow(data)),cols]
data[paste0("Output_inverted", seq_along(cols))] <- data[grep("Output", names(data))] * -1
data
Exampledata <- structure(list(Feature = structure(1:6, .Label = c("Gene_1",
"Gene_2", "Gene_3", "Gene_4", "Gene_5", "Normalizer"), class = "factor"),
Value = c(14.25633934, 16.88394578, 16.01, 13.82329514, 12.96382949,
13.45533012)), row.names = c(NA, 6L), class = "data.frame")

R data table - extract value from alternative columns based on value from another column [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 6 years ago.
I have a data table in R where each row represent a visit of a user in a social media platform. For simplicity, an example of this data is as follows:
UserID Channel TW_VisitDuration TW_Activity FB_VisitDuration FB_Activity
aaa TW 30 High
bbb FB 45 Low
Each visit has a channel (e.g. FB/TW) and the other columns are filled according to this channel (only relevant columns are filled).
I want to have a new table, where all the similar columns are reduced to column, and the value is taken from the relevant column. In this case, the new table will be like this:
UserID Channel VisitDuration Activity
aaa TW 30 High
bbb FB 45 Low
I wrote a for loop which does this evaluation row by row, but I am sure this is not "the R way to do this" (and the performance of the loop would probably be bad as my data will scale).
This is the for loop I wrote:
for (i in 1:nrow(res.table)){
cur.channel = res.table[,Channel][i]
for (field in specific.fields){
print(field)
test.t[[field]][i] = res.table[[paste(cur.channel,field,sep='_')]][i]
}
}
How can I do it without the need to go row by row?
We can use melt from data.table to convert this to 'long' format. Also, the function can take multiple patterns
library(data.table)
melt(setDT(df1), measure = patterns("Visit", "Activity"),
value.name = c("VisitDuration", "Activity"), na.rm = TRUE)[, variable := NULL][]
# UserID Channel VisitDuration Activity
#1: aaa TW 30 High
#2: bbb FB 45 Low
data
df1 <- structure(list(UserID = c("aaa", "bbb"), Channel = c("TW", "FB"
), TW_VisitDuration = c(30L, NA), TW_Activity = c("High", NA),
FB_VisitDuration = c(NA, 45L), FB_Activity = c(NA, "Low")), .Names = c("UserID",
"Channel", "TW_VisitDuration", "TW_Activity", "FB_VisitDuration",
"FB_Activity"), class = "data.frame", row.names = c(NA, -2L))

Regrouping categorical values in a dataframe [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I'm looking for a suggestion: I'm trying to re-order/group a data frame by a variable value.
For example transforming a native data frame VARS
into something like this:
So far, I've tried for-loops with cbind/rbind depending on how the data is organized, aggregate, apply, etc. But there's always some wrinkle that prevents the methods from working.
I appreciate any help!
First I'd like to point out reading up on how to give a usefule example, along with the raw data using dput will go a long way to getting feedback. That said:
For the dataset you showed:
A <- structure(list(Var_Typer = c("cnt", "Cont", "cnt", "cnt", "fact",
"fact", "Char", "Char", "Cont"), R_FIELD = c("Gender", "Age",
"Activation", "WakeUpStroke", "ArMode", "PreHospActiv", "EMTag",
"EMTdx", "EMTlams")), .Names = c("Var_Typer", "R_FIELD"), row.names = c(NA,
-9L), class = "data.frame")
> head(A)
Var_Typer R_FIELD
1 cnt Gender
2 Cont Age
3 cnt Activation
4 cnt WakeUpStroke
5 fact ArMode
6 fact PreHospActiv
B <- apply(
dcast(A, Var_Typer ~ R_FIELD, value.var = 'R_FIELD'), 1, function(i){
ndf <- as.data.frame(rbind(i[complete.cases(i)]))
colnames(ndf) <- c('Class',1:(length(ndf)-1))
ndf
}) %>% rbind.pages %>% (function(x){
x[is.na(x)] <- "..."
x
})
Class 1 2 3
1 Char EMTag EMTdx ...
2 cnt Activation Gender WakeUpStroke
3 Cont Age EMTlams ...
4 fact ArMode PreHospActiv ...

R: Multiple bar plots of mean value vs. month vs. genre [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have the following data-frame, where variable is 10 different genre categories of movies, eg. drama, comedy etc.
> head(grossGenreMonthLong)
Gross ReleasedMonth variable value
5 33508485 2 drama 1
6 67192859 2 drama 1
8 37865 4 drama 1
9 76665507 1 drama 1
10 221594911 2 drama 1
12 446438 2 drama 1
Reproducible dataframe:
dput(head(grossGenreMonthLong))
structure(list(Gross = c(33508485, 67192859, 37865, 76665507,
221594911, 446438), ReleasedMonth = c(2, 2, 4, 1, 2, 2), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("drama", "comedy", "short", "romance",
"action", "crime", "thriller", "documentary", "adventure", "animation"
), class = "factor"), value = c(1, 1, 1, 1, 1, 1)), .Names = c("Gross",
"ReleasedMonth", "variable", "value"), row.names = c(5L, 6L,
8L, 9L, 10L, 12L), class = "data.frame")
I would like to calculate the mean gross vs. month for each of the 10 genres and plot them in separate bar charts using facets (varying by genre).
In other words, what's a quick way to plot 10 bar charts of mean gross vs. month for each of the 10 genres?
You should provide a reproducible example to make it easier for us to help you. dput(my.dataframe) is one way to do it, or you can generate an example dataframe like below. Since you haven't given us a reproducible example, I'm going to put on my telepathy hat and assume the "variable" column in your screenshot is the genre.
n = 100
movies <- data.frame(
genre=sample(letters[1:10], n, replace=T),
gross=runif(n, min=1, max=1e7),
month=sample(12, n, replace=T)
)
head(movies)
# genre gross month
# 1 e 5545765.4 1
# 2 f 3240897.3 3
# 3 f 1438741.9 5
# 4 h 9101261.0 6
# 5 h 926170.8 7
# 6 f 2750921.9 1
(My genres are 'a', 'b', etc).
To do a plot of average gross per month, you will need to calculate average gross per month. One such way to do so is using the plyr package (there is also data.table, dplyr, ...)
library(plyr)
monthly.avg.gross <- ddply(movies, # the input dataframe
.(genre, month), # group by these
summarize, avgGross=mean(gross)) # do this.
The dataframe monthly.avg.gross now has one row per (month, genre) with a column avgGross that has the average gross in that (month, genre).
Now it's a matter of plotting. You have hinted at "facet" so I assume you're using ggplot.
library(ggplot2)
ggplot(monthly.avg.gross, aes(x=month, y=avgGross)) +
geom_point() +
facet_wrap(~ genre)
You can do stuff like add month labels and treat month as a factor instead of a number like here, but that's peripheral to your question.
Thank you very much #mathematical.coffee. I was able to adapt your answer to produce the appropriate bar charts.
meanGrossGenreMonth = ddply(grossGenreMonthLong,
.(ReleasedMonth, variable),
summarise,
mean.Gross = mean(Gross, na.rm = TRUE))
# plot bar plots with facets
ggplot(meanGrossGenreMonth, aes(x = factor(ReleasedMonth), y=mean.Gross))
+ geom_bar(stat = "identity") + facet_wrap(~ variable) +ylab("mean Gross ($)")
+ xlab("Month") +ggtitle("Mean gross revenue vs. month released by Genre")

R - .csv file - extract variables [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I pulled in a large .csv file with columns such as "paid" and "description"
I am trying to figure out how to only pull the "paid" column when the "description" is Bronchitis or some other illness that is in the column.
This would be like doing a pivot table in Excel and filtering only on a certain Description and receiving all of the individual paid rows.
Paid Description val
$500 Bronchitis 1.5
$3,250 'Complication of Pregnancy/Childbirth' 2.2
$5,400 Burns 3.3
$20.50 Bronchitis 4.4
$24 Ashtma 1.2
If your data is
paid <- c(300,200,150)
desc <- c("bronchitis","headache","broken.leg")
df <- data.frame(paid, desc)
Try
df[desc=="bronchitis",c("paid")]
# the argument ahead of the comma filters the row,
# the argument after the comma refers to the column
# > df[desc=="bronchitis",c("paid")]
# [1] 300
or
library(dplyr)
df %>% filter(desc=="bronchitis") %>% select(paid)
# filter refers to the row condition
# select filters the output column(s)
# > df %>% filter(desc=="bronchitis") %>% select(paid)
# paid
# 1 300
Using data.table
library(data.table)#v1.9.5+
setkey(setDT(df1), Description)[.('Bronchitis'),'Paid', with=FALSE]
# Paid
#1: $500
#2: $20.50
data
df1 <- structure(list(ex = c("Description", "Bronchitis",
"Complication of Pregnancy/Childbirth",
"Burns", "Bronchitis", "Ashtma"), data = c("val", "1.5", "2.2",
"3.3", "4.4", "1.2")), .Names = c("ex", "data"), class = "data.frame",
row.names = c("Paid", "$500", "$3,250", "$5,400", "$20.50", "$24"))

Resources