This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I'm having more problems with the tidyr package in R. I am doing an experiment involving splitting up the data frame into plot, plant, and leaf variables, and since I have a large data frame, I need to do this with a code. I'm using RStudio and using the tidyr package.
I need to organize a data frame from this:
library(readr)
library(tidyr)
library(dplyr)
plot <- c("101","101","101","101","101","102","102","102","102","102")
plant <- c("1","2","3","4","5","1","2","3","4","5")
leaf_1 <- c("100","100","100","100","100","100","100","100","100","100")
leaf_2 <- c("90","90","90","90","90","90","90","90","90","90")
leaf_3 <- c("80","80","80","80","80","80","80","80","80","80")
plot <- as.data.frame(plot)
plant <- as.data.frame(plant)
leaf_1 <- as.data.frame(leaf_1)
leaf_2 <- as.data.frame(leaf_2)
leaf_3 <- as.data.frame(leaf_3)
data <- cbind(plot, plant, leaf_1, leaf_2, leaf_3)
View(data)
Into this:
plot <- c("101","101","101", "101","101","101","101","101","101","101","101","101","101","101","101")
plant <- c("1","1","1","2","2","2","3","3","3","4","4","4","5","5","5")
leaf_number <- c("1","2","3","1","2","3","1","2","3","1","2","3","1","2","3")
score <- c("100","90","80","100","90","80","100","90","80","100","90","80","100","90","80")
plot <- as.data.frame(plot)
plant <- as.data.frame(plant)
leaf_number <- as.data.frame(leaf_number)
score <- as.data.frame(score)
example <- cbind(plot, plant, leaf_number, score)
View(example)
Here is what I have already tried:
data1 <- gather(data, leaf_number, score, -plot)
But it just doesn't gather the data frame into what I need. Any help is greatly appreciated, thanks so much everybody!
data <- data.frame(
plot = c(101,101,101,101,101,102,102,102,102,102),
plant = c(1,2,3,4,5,1,2,3,4,5),
leaf_1 = c(100,100,100,100,100,100,100,100,100,100),
leaf_2 = c(90,90,90,90,90,90,90,90,90,90),
leaf_3 = c(80,80,80,80,80,80,80,80,80,80)
)
gather(data, leaf_number, score, -c(plot, plant))
# plot plant leaf_number score
#1 101 1 leaf_1 100
#2 101 2 leaf_1 100
#3 101 3 leaf_1 100
#4 101 4 leaf_1 100
#5 101 5 leaf_1 100
#6 102 1 leaf_1 100
#7 102 2 leaf_1 100
#etc.
Related
I have a dataset, df, where columns consist of various chemicals and rows consist of samples identified by their id and the concentration of each chemical.
I need to correct the chemical concentrations using a unique value for each chemical, which are found in another dataset, df2.
Here's a minimal df1 dataset:
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",")
and here is a df2 example:
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",")
What I need to do is to divide all observations of chem1 in df1 by the value provided for chem1 in df2, repeated for each chemical. In reality, chemical names are not sequential, and there's roughly 30 chemicals.
Previously I would have done this using Excel and index/match but I'm looking to make my methods more reproducible, hence fighting my way through with R. I mostly do data manipulation with dplyr, so if there's a tidyverse solution out there, that would be great!
Thankful for any help
We can use the 'chem' column from 'df2' to subset the 'df1', divide by the 'value' column of 'df2' replicated to make the lengths same and update the columns of 'df1' by assigning the results back
df1[as.character(df2$chem)] <- df1[as.character(df2$chem)]/df2$value[col(df1[-1])]
Using reshape2 package, the data frame can be changed to long format to merge with the df2 as follows. (Note that the example df introduce some whitespace that are filtered in this solution)
library(reshape2)
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",",stringsAsFactors = F)
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",",stringsAsFactors = F)
df2$chem <- gsub("\\s+","",df2$chem) #example introduces whitespaces in the names
df1A <- melt(df1,id.vars=c("id"),variable.name="chem")
combined <- merge(x=df1A,y=df2,by="chem",all.x=T)
combined$div <- combined$value.x/combined$value.y
head(combined)
chem id value.x value.y div
1 chem1 1 0.5 1.7 0.2941176
2 chem1 2 1.5 1.7 0.8823529
3 chem1 3 1.0 1.7 0.5882353
4 chem1 4 2.0 1.7 1.1764706
5 chem1 5 3.0 1.7 1.7647059
6 chem2 1 1.0 2.3 0.4347826
or in wide format:
> dcast(combined[,c("id","chem","div")],id ~ chem,value.var="div")
id chem1 chem2 chem3 chemA chemB
1 1 0.2941176 0.4347826 1.2195122 0.7692308 1.1111111
2 2 0.8823529 0.2173913 0.4878049 0.5769231 1.4814815
3 3 0.5882353 0.4347826 0.6097561 1.3461538 0.3703704
4 4 1.1764706 2.1739130 0.7317073 0.1923077 2.5925926
5 5 1.7647059 1.7391304 0.5609756 0.1346154 0.8518519
Here's a tidyverse solution.
df3 <- df1 %>%
# convert the data from wide to long to make the next step easier
gather(key = chem, value = value, -id) %>%
# do your math, using 'match' to map values from df2 to rows in df3
mutate(value = value/df2$value[match(df3$chem, df2$chem)]) %>%
# return the data to wide format if that's how you prefer to store it
spread(chem, value)
Sorry, I've tried my best but I didn't find the answer. As beginner, I'm not sure that I'm able to put the question clearly. Thanks in advance.
So I have a dataframe with data about consumption with 24000 rows.
In this dataframe, there is a series of variable about the number of objects bought within the last two months :
NumberOfCoat, NumberOfShirt, NumberOfPants, NumberOfShoes...
And there is a variable "profession" registered by number.
So now the data looks looks like this
profession NumberOfCoat NumberOfShirt NumberOfShoes
individu1 1 1 1 1
individu2 3 2 4 1
individu3 2 2 0 0
individu4 6 0 3 2
individu5 5 0 2 3
individu6 7 1 0 5
individu7 4 3 1 2
I would like to know the structure of consumption by profession and get something like this :
ProportionOfCoat ProportionOfShirt ProportionOfShoe...
profession1 0.3 0.5 0.1
profession2 0.1 0.2 0.4
profession3 0.2 0.6 0.1
profession4 0.1 0.1 0.2
I don't know if it is clear, but finally I want to be able to say :
10% of clothing products that doctors bought are Tshirts whereas 20% of what teachers bought are T-shirts.
And finally, I'd like to draw a stacked barplot where each stack is scaled to sum to 100%.
I suppose that we can you dplyr ?
Thank you very much !!
temp <- aggregate( . ~ profession, data=zzz, FUN=sum)
cbind(temp[1],temp[-1]/rowSums(temp[-1]))
or also using prop.table
As other people noted, it is always better to post a reproducible example, I´ll try to post one with my solution, which is longer than the ones already posted but, for the same reason, maybe clearer.
First you should create an example dataframe:
set.seed(10) # I set a seed cause I´ll use the sample() function
n <- 1:100 # vector from 1 to 100 to obtain the number of products bought
p <- 1:8 # vector for obtaining id of professions
profession <- sample(p,50, replace = TRUE)
NumberOfCoat <- sample(n,50, replace = TRUE)
NumberOfShirt <- sample(n,50, replace = TRUE)
NumberOfShoes <- sample(n,50, replace = TRUE)
df <- as.data.frame(cbind(profession, NumberOfCoat,
NumberOfShirt, NumberOfShoes))
Once you got the dataframe, you can explain what you have tried so far, or a possible solution. Here I used dplyr.
df <- df %>% group_by(profession) %>% summarize(coats = sum(NumberOfCoat),
shirts = sum(NumberOfShirt),
shoes = sum(NumberOfShoes)) %>%
mutate(tot_prod = coats + shirts + shoes,
ProportionOfCoat = coats/tot_prod,
ProportionOfShirt = shirts/tot_prod,
ProportionofShoes = shoes/tot_prod) %>%
select(profession, ProportionOfCoat, ProportionOfShirt,
ProportionofShoes)
dfcorresponds to the second dataframe you show, where you have the proportion of each product bought by each profession. In my example looks like this:
profession ProportionOfCoat ProportionOfShirt ProportionofShoes
<int> <dbl> <dbl> <dbl>
1 1 0.3910483 0.2343934 0.3745583
2 2 0.4069641 0.3525571 0.2404788
3 3 0.3330804 0.3968134 0.2701062
4 4 0.2740657 0.3952435 0.3306908
5 5 0.2573991 0.3784753 0.3641256
6 6 0.2293814 0.3543814 0.4162371
7 7 0.2245841 0.3955638 0.3798521
8 8 0.2861635 0.3490566 0.3647799
If you want to produce a stacked barplot, you have to reshape your data to a long format in order to be able to use ggplot2. As #alistaire noted, you can do it with the gather function from the tidyr package.
df <- df %>% gather(product, proportion, -profession)
And finally you can plot it with ggplot2.
ggplot(df, aes(x=profession, y=proportion, fill=product)) +
geom_bar(stat="identity")
I have a dataframe that includes four bacteria types: R, B, P, Bi - this is in variable.x
value.y is their abundance and variable.y is various groups they are in.
I would like to plot them according to their food categories: "FiberCategory", "FruitCategory", "VegetablesCategory" & "WholegrainCategory." I have made 4 separate files that have the as such:
Sample Bacteria Abundance Category Level
30841102 R 0.005293192 1 Low
30841102 P 0.000002570 1 Low
30841102 B 0.005813275 1 Low
30841102 Bi 0.000000000 1 Low
49812105 R 0.003298709 1 Low
49812105 P 0.000000855 1 Low
49812105 B 0.131147541 1 Low
49812105 Bi 0.000350086 1 Low
So, I would like a bar plot of how much of each bacteria is in each category. So it should be 4 plots, for each bacteria, with value on the y-axis and food category on the x-axis.
I have tried this code:
library(dplyr)
genus_veg %>% group_by(Genus, Abundance) %>% summarise(Abundance = sum(Abundance)) %>%
ggplot(aes(x = Level, y= Abundance, fill = Genus)) + geom_bar(stat="identity")
But get this error:
Error: cannot modify grouping variable
Any suggestions?
TL;DR Combine individual plots with cowplot
In another interpretation of the super unclear question, this time from:
Plotting Bacteria according to Food Groups & Abundance in R
and
would like to plot them according to their food categories: "FiberCategory", "FruitCategory", "VegetablesCategory" & "WholegrainCategory." I have made 4 separate files
You might be asking for:
You want a bar chart
You want 4 plots, one for each of the food categories
x-axis = bacteria type
y-axis = abundance of bacteria
Input
Let say you have a data frame for each food category. (Again, I'm using dummy data)
library(tidyr)
library(dplyr)
library(ggplot2)
## The categories you have defined
bacteria <- c("R", "B", "P", "Bi")
food <- c("FiberCategory", "FruitCategory", "VegetablesCategory", "WholegrainCategory")
## Create dummy data for plotting
set.seed(1)
num_rows <- length(bacteria)
num_cols <- length(food)
dummydata <-
matrix(data = abs(rnorm(num_rows*num_cols, mean=0.01, sd=0.05)),
nrow=num_rows, ncol=num_cols)
rownames(dummydata) <- bacteria
colnames(dummydata) <- food
dummydata <-
dummydata %>%
as.data.frame() %>%
tibble::rownames_to_column("bacteria") %>%
gather(food, abundance, -bacteria)
## If we have 4 data frames
filter_food <- function(dummydata, foodcat){
dummydata %>%
filter(food == foodcat) %>%
select(-food)
}
dd_fiber <- filter_food(dummydata, "FiberCategory")
dd_fruit <- filter_food(dummydata, "FruitCategory")
dd_veg <- filter_food(dummydata, "VegetablesCategory")
dd_grain <- filter_food(dummydata, "WholegrainCategory")
Where one data frame looks something like
#> dd_grain
# bacteria abundance
#1 R 0.02106203
#2 B 0.10073499
#3 P 0.06624655
#4 Bi 0.00775332
Plot
You can create separate plots. (Here, I'm using a function to generate my plots)
plot_food <- function(dd, title=""){
dd %>%
ggplot(aes(x = bacteria, y = abundance)) +
geom_bar(stat = "identity") +
ggtitle(title)
}
plt_fiber <- plot_food(dd_fiber, "fiber")
plt_fruit <- plot_food(dd_fruit, "fruit")
plt_veg <- plot_food(dd_veg, "veg")
plt_grain <- plot_food(dd_grain, "grain")
And then combine them using cowplot
cowplot::plot_grid(plt_fiber, plt_fruit, plt_veg, plt_grain)
TL;DR Plotting by facets
How you posed the question is super unclear. So I have interpreted your question from
So, I would like a bar plot of how much of each bacteria is in each category. So it should be 4 plots, for each bacteria, with value on the y-axis and food category on the x-axis.
as:
You want a bar chart
You want 4 plots, one for each of the bacteria types: R, B, P, Bi
x-axis = food category
y-axis = abundance of bacteria
Input
In regards to the input data, the data was unclear e.g. you did not describe what "Sample", "Level", or "Category" is. Ideally, you would keep all the food category in one data frame. e.g.
library(tidyr)
library(dplyr)
library(ggplot2)
## The categories you have defined
bacteria <- c("R", "B", "P", "Bi")
food <- c("FiberCategory", "FruitCategory", "VegetablesCategory", "WholegrainCategory")
## Create dummy data for plotting
set.seed(1)
num_rows <- length(bacteria)
num_cols <- length(food)
dummydata <-
matrix(data = abs(rnorm(num_rows*num_cols, mean=0.01, sd=0.05)),
nrow=num_rows, ncol=num_cols)
rownames(dummydata) <- bacteria
colnames(dummydata) <- food
dummydata <-
dummydata %>%
as.data.frame() %>%
tibble::rownames_to_column("bacteria") %>%
gather(food, abundance, -bacteria)
of which the output looks like:
#> dummydata
# bacteria food abundance
#1 R FiberCategory 0.021322691
#2 B FiberCategory 0.019182166
#3 P FiberCategory 0.031781431
#4 Bi FiberCategory 0.089764040
#5 R FruitCategory 0.026475389
#6 B FruitCategory 0.031023419
#7 P FruitCategory 0.034371453
#8 Bi FruitCategory 0.046916235
#9 R VegetablesCategory 0.038789068
#10 B VegetablesCategory 0.005269419
#11 P VegetablesCategory 0.085589058
#12 Bi VegetablesCategory 0.029492162
#13 R WholegrainCategory 0.021062029
#14 B WholegrainCategory 0.100734994
#15 P WholegrainCategory 0.066246546
#16 Bi WholegrainCategory 0.007753320
Plot
Once you have the data formatted as above, you can simply do:
dummydata %>%
ggplot(aes(x = food,
y = abundance,
group = bacteria)) +
geom_bar(stat="identity") +
## Split into 4 plots
## Note: can also use 'facet_grid' to do this
facet_wrap(~bacteria) +
theme(
## rotate the x-axis label
axis.text.x = element_text(angle=90, hjust=1, vjust=.5)
)
I have a data frame, one of the columns representing years. Let's say
region <- c("Spain", "Italy", "Norway")
year <- c("2010","2011","2012","2010","2011","2012","2010","2011","2012")
m1 <- c("10","11","12","13","14","15","16","17","18")
m2 <- c("20","30","40","50","60","70","80","90","100")
data <- data.frame(region,year,m1,m2)
I want to aggregate the data set m1 in a way taking 3-year averages for each country. I am confused in how to do that with a data frame. Any comment is highly appreciated.
Thanks in advance!
First, your m1 variable needs to be numeric. Convert it using as.numeric():
data$m1 <- as.numeric(as.character(data$m1))
Then, you can use aggregate like this:
aggregate(m1 ~ region, FUN = mean, data = data)
# region m1
# 1 Italy 14
# 2 Norway 15
# 3 Spain 13
To avoid the awkward type conversion (as.numeric(as.character())), you should eliminate the quotes from the setup for m1 and m2:
m1 <- c(10,11,12,13,14,15,16,17,18)
m2 <- c(20,30,40,50,60,70,80,90,100)
Alternative approach using dplyr:
library(dplyr)
region <- c("Spain", "Italy", "Norway")
year <- c("2010","2011","2012","2010","2011","2012","2010","2011","2012")
m1 <- c(10,11,12,13,14,15,16,17,18)
m2 <- c(20,30,40,50,60,70,80,90,100)
data <- data.frame(region,year,m1,m2)
data %>%
group_by(region) %>%
summarise(mean_m1 = mean(m1),
mean_m2 = mean(m2))
# region mean_m1 mean_m2
# 1 Italy 14 60
# 2 Norway 15 70
# 3 Spain 13 50
I am working with a dataset that comes with lme4, and am trying to learn how to apply reshape2 to convert it from long to wide [full code at the end of the post].
library(lme4)
data("VerbAgg") # load the dataset
The dataset has 9 variables; 'Anger', 'Gender', and 'id' don't vary with 'item', while 'resp',
'btype', 'situ', 'mode', and 'r2' do.
I have successfully been able to convert the dataset from long to wide format using reshape():
wide <- reshape(VerbAgg, timevar=c("item"),
idvar=c("id", 'Gender', 'Anger'), dir="wide")
Which yields 316 observations on 123 variables, and appears to be correctly transformed. However, I have had no success using reshape/reshape2 to reproduce the wide dataframe.
wide2 <- recast(VerbAgg, id + Gender + Anger ~ item + variable)
Using Gender, item, resp, id, btype, situ, mode, r2 as id variables
Error: Casting formula contains variables not found in molten data: Anger
I may not be 100% clear on how recast defines id variables, but I am very confused why it does not see "Anger". Similarly,
wide3 <- recast(VerbAgg, id + Gender + Anger ~ item + variable,
id.var = c("id", "Gender", "Anger"))
Error: Casting formula contains variables not found in molten data: item
Can anyone see what I am doing wrong? I would love to obtain a better understanding of melt/cast!
Full code:
## load the lme4 package
library(lme4)
data("VerbAgg")
head(VerbAgg)
names(VerbAgg)
# Using base reshape()
wide <- reshape(VerbAgg, timevar=c("item"),
idvar=c("id", 'Gender', 'Anger'), dir="wide")
# Using recast
library(reshape2)
wide2 <- recast(VerbAgg, id + Gender + Anger ~ item + variable)
wide3 <- recast(VerbAgg, id + Gender + Anger ~ item + variable,
id.var = c("id", "Gender", "Anger"))
# Using melt/cast
m <- melt(VerbAgg, id=c("id", "Gender", "Anger"))
wide <- o cast(m,id+Gender+Anger~...)
Aggregation requires fun.aggregate: length used as default
# Yields a list object with a length of 8?
m <- melt(VerbAgg, id=c("id", "Gender", "Anger"), measure.vars = c(4,6,7,8,9))
wide <- dcast(m, id ~ variable)
# Yields a data frame object with 6 variables.
I think the following code does what you want.
library(lme4)
data("VerbAgg")
# Using base reshape()
wide <- reshape(VerbAgg, timevar=c("item"),
idvar=c("id", 'Gender', 'Anger'), dir="wide")
dim(wide) # 316 123
# Using melt/cast
require(reshape2)
m1 <- melt(VerbAgg, id=c("id", "Gender", "Anger","item"), measure=c('resp','btype','situ','mode','r2'))
wide4 <- dcast(m1,id+Gender+Anger~item+variable)
dim(wide4) # 316 123
R> wide[1:5,1:6]
Anger Gender id resp.S1WantCurse btype.S1WantCurse situ.S1WantCurse
1 20 M 1 no curse other
2 11 M 2 no curse other
3 17 F 3 perhaps curse other
4 21 F 4 perhaps curse other
5 17 F 5 perhaps curse other
R> wide4[1:5,1:6]
id Gender Anger S1WantCurse_resp S1WantCurse_btype S1WantCurse_situ
1 1 M 20 no curse other
2 2 M 11 no curse other
3 3 F 17 perhaps curse other
4 4 F 21 perhaps curse other
5 5 F 17 perhaps curse other