y -axis scale as percent in ggplot

y -axis scale as percent in ggplot - r

I read already through all the other similar questions & answers, but all the given solutions with scale_y_continous simply don't work for my dataset. I have two different treatment groups data$InvA (Post-Covid) and data$InvAcc (Pre-Covid) where in each group they could choose between the options: Online Broker (1), Bank (2), No Account (3). As the subjects were put randomly in group 1 or 2, I have logically a lot of NA's in my dataset. Now when I use ggplot, I'm able to display both results with the total number of individuals on the y-axis. However, I would like to change this to percent, since it would be a better fit for my thesis. I tried already every other option with scale_y_continuous but it rather doesn't work out properly (3000% percentage, or it doesn't calculate the right percent values) or it doesn't work at all.
This is my code:
library(gridExtra)
library(ggplot2)
require(gridExtra)
library(tidyverse)
plot1 <- ggplot(data = data, aes(InvA), na.rm=TRUE) +
geom_bar()+
scale_x_discrete(na.translate = FALSE)+
ylim(0,40)+
ggtitle("Post-Covid")+
xlab("Accounts")+
ylab("Total No. of Individuals")
plot2 <- ggplot(data = data, aes(InvAcc), na.rm=TRUE) +
geom_bar()+
scale_x_discrete(na.translate = FALSE)+
ylim(0,40)+
ggtitle("Pre-Covid")+
xlab("Accounts")+
ylab("Total No. of Individuals")
grid.arrange(plot2, plot1,ncol=2) # Write the grid.arrange in the file
#dev.off() # Close the file
#pdf("Accountss.pdf", width = 8, height = 6) # Open a new pdf file
My data:
dput(data)
structure(list(data.InvAcc = c(2L, NA, 2L, NA, NA, 3L, 3L, 3L,
NA, 3L, 3L, NA, 1L, NA, 1L, NA, NA, 1L, NA, NA, NA, 1L, 3L, 1L,
NA, NA, 1L, 2L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 2L, NA,
NA, 3L, NA, NA, 1L, NA, 2L, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
1L, 1L, 1L, NA, NA, NA, 3L, NA, 1L, NA, NA, 2L, NA, 1L, 1L, 1L,
NA, 1L, 3L, NA, 1L, NA, 3L, NA, NA, 2L, 3L, 2L, 1L, NA, 3L, 2L,
NA, NA, 3L, NA, 2L, 1L, NA, 3L, 2L, 1L, 3L, 3L, 3L, NA, 3L, NA,
3L, NA, 3L, 1L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, 3L, NA,
NA, 3L, 3L, 3L, 3L, NA, 1L, NA, NA, NA, 3L, NA, 3L), data.InvA = c(NA,
1L, NA, 2L, 1L, NA, NA, NA, 3L, NA, NA, 3L, NA, 3L, NA, 1L, 2L,
NA, 1L, 1L, 1L, NA, NA, NA, 1L, 2L, NA, NA, 2L, 1L, NA, NA, NA,
NA, NA, NA, NA, 3L, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 3L,
1L, 1L, 1L, 2L, 1L, 1L, NA, NA, NA, NA, 1L, 1L, 1L, NA, 2L, NA,
2L, 1L, NA, 2L, NA, NA, NA, 2L, NA, NA, 2L, NA, 1L, NA, 3L, 3L,
NA, NA, NA, NA, 1L, NA, NA, 1L, 2L, NA, 1L, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 3L, NA, 1L,
2L, 2L, NA, 1L, 1L, NA, 3L, 1L, NA, NA, NA, NA, 1L, NA, 1L, 3L,
1L, NA, 3L, NA)), class = "data.frame", row.names = c(NA, -133L
))
data$InvAcc: Online Broker --> 31 (45%), Bank --> 11 (16%), No Account --> 27(39%)
data$InvA: Online Broker --> 40 (63%), Bank --> 13 (20%), No Account --> 11(17%)
Thank you all for your help, appreciate your time!

The issue is that you are plotting the counts. If you want to plot the percentages than you have to tell ggplot to do so using e.g. y = after_stat(prop) which instead of the counts will map the proportions on y. Afterwards you could get petrcent labels using scales::percent:
library(gridExtra)
library(ggplot2)
plot1 <- ggplot(data = data, aes(InvA, y = after_stat(prop)), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ylim(0, 40) +
ggtitle("Post-Covid") +
xlab("Accounts") +
ylab("Total No. of Individuals") +
scale_y_continuous(labels = scales::percent)
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
plot2 <- ggplot(data = data, aes(InvAcc, y = after_stat(prop)), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ylim(0, 40) +
ggtitle("Pre-Covid") +
xlab("Accounts") +
ylab("Total No. of Individuals") +
scale_y_continuous(labels = scales::percent)
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
grid.arrange(plot2, plot1, ncol = 2)
#> Warning: Removed 64 rows containing non-finite values (stat_count).
#> Warning: Removed 69 rows containing non-finite values (stat_count).

Related

Convert multiple columns from numeric to factor

I thought this task is simple, then I was surprised that it wasn't.
I have multiple selected columns with coded responses (likert-scales). I want to transform them into a factor variable with factor levels (some of them were never chosen). The questionnair is in German, that is why I you probably won't be able to understand the labels.
df[,c(3:21,23:25)] <- apply(df[,c(3:21,23:25)],2,
function (x) factor(x,
levels = c(0,1,2,3,4),
labels = c("gar nicht",
"gering",
"eher schwach",
"eher stark",
"sehr stark")))
df[,22] <- apply(df[,22],1,
function (x) factor(x,
levels = c(0,1,2,3),
labels = c("gar nicht",
"sofort",
"mittelfristig",
"langfristig")))
I will need to split those data frames because of the different scales. Nevertheless,
it does not transform my data accurately. The outcome is a character.
Here is my test data:
structure(list(ï..lfdNr = 1:20, company = c("Nationalpark Thayathal",
"Naturpark Heidenreichsteiner Moor", "Naturpark Hohe Wand", "Tierpark Stadt Haag",
"Ã–tscher TropfsteinhÃ¶hle", "Carnuntum", "Stift Heiligenkreuz",
"Ruine Kollmitz", "Schlosshof", "Retzer Erlebniskeller", "LOISIUM Weinwelt",
"Bio Imkerei StÃ¶gerer", "Amethyst Welt Maissau", "Donau NiederÃ¶sterreich tourismus",
"NiederÃ¶sterreich Bahnen", "Benediktinerstift Melk", "Kunstmeile Krems",
"Die Garten Tulln", "Winzer Krems ", "DomÃ¤ne Wachau"), A2_1_hitz = c(4L,
NA, NA, 3L, NA, NA, 3L, 2L, 3L, NA, 3L, NA, 3L, NA, 2L, 3L, 3L,
4L, 2L, 3L), A2_2_trock = c(3L, NA, NA, 3L, NA, NA, 3L, NA, 3L,
NA, 2L, NA, 1L, NA, 2L, 4L, 3L, 4L, 2L, 3L), A2_3_reg = c(2L,
NA, NA, 2L, NA, NA, 3L, 2L, 3L, NA, 3L, NA, 2L, NA, 3L, 4L, 2L,
3L, 4L, 2L), A2_4_schnee = c(4L, NA, NA, 3L, NA, NA, NA, 3L,
3L, NA, 1L, NA, 0L, NA, 4L, NA, 3L, 4L, 4L, 1L), B1_1_hitz = c(4L,
NA, NA, 1L, NA, NA, NA, 3L, 3L, NA, 2L, NA, NA, NA, 2L, 3L, 2L,
4L, 0L, 2L), B1_2_trock = c(3L, NA, NA, 2L, NA, NA, NA, NA, 3L,
NA, 0L, NA, NA, NA, 2L, 3L, 2L, 4L, 3L, 1L), B1_3_reg = c(2L,
NA, NA, 1L, NA, NA, NA, NA, 3L, NA, 3L, NA, NA, NA, 3L, 3L, 1L,
2L, 3L, 3L), B1_4_schnee = c(1L, NA, NA, 0L, NA, NA, 0L, 0L,
1L, NA, NA, NA, NA, NA, 4L, 1L, 0L, 4L, 0L, 0L), B2_1_nZuk = c(3L,
NA, NA, 0L, NA, NA, NA, 0L, 0L, NA, 0L, NA, 0L, 3L, 3L, 0L, 3L,
2L, 0L, 0L), B2_2_mZuk = c(3L, NA, NA, 0L, NA, NA, NA, 0L, 2L,
NA, 2L, NA, 0L, 2L, 3L, 0L, 3L, 2L, 3L, 0L), B2_3_fZuk = c(3L,
NA, NA, 2L, NA, NA, NA, NA, 2L, NA, 2L, NA, 0L, 2L, 3L, 0L, 3L,
NA, 3L, 0L), C1_1_aktEin = c(2L, NA, NA, 1L, NA, NA, NA, NA,
2L, NA, NA, NA, NA, NA, NA, 0L, 1L, 3L, 2L, 3L), C1_2_zukEin = c(3L,
NA, NA, 2L, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, 0L, 2L,
4L, 3L, 3L), C2_1_bisVer = c(2L, NA, NA, 1L, NA, NA, NA, NA,
2L, NA, NA, NA, NA, NA, 2L, 2L, 1L, 3L, 2L, 2L), C2_2_zukVer = c(3L,
NA, NA, 2L, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, 2L, 2L, 2L,
3L, 3L, 2L), C3_1_bisVer = c(NA, NA, NA, 1L, NA, NA, 2L, NA,
3L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, 2L, 2L), C3_2_zukVer = c(NA,
NA, NA, 2L, NA, NA, 3L, NA, 3L, NA, NA, NA, NA, NA, 1L, 2L, 2L,
NA, 3L, 2L), C4_1_EinKlim = c(NA, NA, NA, 2L, NA, NA, NA, NA,
2L, NA, 2L, NA, NA, NA, 3L, 0L, 1L, NA, 3L, 1L), D1a_1_StÃ.rke = c(NA,
NA, NA, 3L, NA, NA, NA, NA, 3L, NA, NA, NA, 3L, NA, 2L, 3L, 2L,
3L, 3L, 3L), D1b_1_Dring = c(NA, NA, NA, NA, NA, NA, 2L, 3L,
NA, NA, NA, NA, 2L, NA, 1L, 1L, 1L, 1L, 1L, 1L), D5_1_bestBed = c(NA,
NA, NA, 0L, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, 2L, 1L,
NA, 3L, 3L), E1_1_zuBesuch = c(NA, NA, NA, 2L, NA, NA, NA, NA,
3L, NA, NA, NA, NA, NA, 4L, 1L, 4L, NA, 4L, NA), E1_2_wirtBed = c(NA,
NA, NA, 3L, NA, NA, 3L, NA, 2L, NA, NA, NA, NA, NA, 1L, 1L, 4L,
NA, 3L, NA)), row.names = c(NA, 20L), class = "data.frame")
Thanks,
nadine

We need lapply and not apply as apply converts to matrix and matrix can have only a single class
df[,c(3:21,23:25)] <- lapply(df[,c(3:21,23:25)],
function (x) factor(x,
levels = c(0,1,2,3),
labels = c("gar nicht",
"sofort",
"mittelfristig",
"langfristig")))

Label single bars in ggplot

I wonder how it would be possible to add labels to single bars in ggplot2 as I would like to label my rows in my barplot as Online Broker, Bank, No Account. Thank you for your help!
Here is my code:
library(gridExtra)
library(ggplot2)
require(gridExtra)
library(tidyverse)
library (scales)
plot1 <- ggplot(data = df, aes(df$InvA, y = after_stat(prop)), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ggtitle("Post-Covid") +
xlab("Accounts") +
ylab("Individuals") +
scale_y_continuous(labels = percent_format(), limits=c(0,0.8))
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
plot2 <- ggplot(data = df, aes(df$InvAcc, y = after_stat(prop)), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ggtitle("Pre-Covid") +
xlab("Accounts") +
ylab("Individuals") +
scale_y_continuous(labels = percent_format(), limits=c(0,0.8))
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
grid.arrange(plot2, plot1, ncol = 2)
The plot then looks like this:
However, the labels should look like these for my previous plot, but since I managed to get the y-scale to percent, my labels disappeared or it doesn't work to compute the y-scale to percent with the values remaining as factors (Online-Broker, Bank, No-Account) since I had to change them to numeric (1,2,3):
dput(dfaccounts) # (with 1=Online Broker, 2=Bank, 3=No Account)
structure(list(df.InvAcc = c(2L, NA, 2L, NA, NA, 3L, 3L, 3L,
NA, 3L, 3L, NA, 1L, NA, 1L, NA, NA, 1L, NA, NA, NA, 1L, 3L, 1L,
NA, NA, 1L, 2L, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 2L, NA,
NA, 3L, NA, NA, 1L, NA, 2L, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
1L, 1L, 1L, NA, NA, NA, 3L, NA, 1L, NA, NA, 2L, NA, 1L, 1L, 1L,
NA, 1L, 3L, NA, 1L, NA, 3L, NA, NA, 2L, 3L, 2L, 1L, NA, 3L, 2L,
NA, NA, 3L, NA, 2L, 1L, NA, 3L, 2L, 1L, 3L, 3L, 3L, NA, 3L, NA,
3L, NA, 3L, 1L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, 3L, NA,
NA, 3L, 3L, 3L, 3L, NA, 1L, NA, NA, NA, 3L, NA, 3L), df.InvA = c(NA,
1L, NA, 2L, 1L, NA, NA, NA, 3L, NA, NA, 3L, NA, 3L, NA, 1L, 2L,
NA, 1L, 1L, 1L, NA, NA, NA, 1L, 2L, NA, NA, 2L, 1L, NA, NA, NA,
NA, NA, NA, NA, 3L, NA, 1L, 1L, NA, 1L, 1L, NA, 1L, NA, 1L, 3L,
1L, 1L, 1L, 2L, 1L, 1L, NA, NA, NA, NA, 1L, 1L, 1L, NA, 2L, NA,
2L, 1L, NA, 2L, NA, NA, NA, 2L, NA, NA, 2L, NA, 1L, NA, 3L, 3L,
NA, NA, NA, NA, 1L, NA, NA, 1L, 2L, NA, 1L, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, 1L, 1L, 3L, NA, 1L,
2L, 2L, NA, 1L, 1L, NA, 3L, 1L, NA, NA, NA, NA, 1L, NA, 1L, 3L,
1L, NA, 3L, NA)), class = "data.frame", row.names = c(NA, -133L
))

The issue is that you InvA and InvAcc columns are numerics. Hence, using scale_x_discrete the axis text gets dropped.
To fix you issue I would suggest to convert the columns to factors with your desired labels set via the labels argument. Additionally, to get the right percentages we have to explicitly set the group aes via group=1:
library(gridExtra)
library(ggplot2)
labels <- c("Online-Broker", "Bank", "No Account")
data$InvA <- factor(data$InvA, labels = labels)
data$InvAcc <- factor(data$InvAcc, labels = labels)
plot1 <- ggplot(data = data, aes(InvA, y = after_stat(prop), group = 1), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ylim(0, 40) +
ggtitle("Post-Covid") +
xlab("Accounts") +
ylab("Total No. of Individuals") +
scale_y_continuous(labels = scales::percent)
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
plot2 <- ggplot(data = data, aes(InvAcc, y = after_stat(prop), group = 1), na.rm = TRUE) +
geom_bar() +
scale_x_discrete(na.translate = FALSE) +
ylim(0, 40) +
ggtitle("Pre-Covid") +
xlab("Accounts") +
ylab("Total No. of Individuals") +
scale_y_continuous(labels = scales::percent)
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
grid.arrange(plot2, plot1, ncol = 2)
#> Warning: Removed 64 rows containing non-finite values (stat_count).
#> Warning: Removed 69 rows containing non-finite values (stat_count).

set factor level in ggplot2

Hello I would need help in order to sort geom_segment in my plot by the column end_scaffold.
Here is the code I used to produce the following plot :
library(ggplot2)
#Here I try to sort the data in order to get geom_segment sorted in the plot but it does not work
tab<-tab[with(tab, order(-end_scaff,-end_gene)), ]
ggplot(tab, aes(x = start_scaff, xend = end_scaff,
y = molecule, yend = molecule)) +
geom_segment(size = 3, col = "grey80") +
geom_segment(aes(x = ifelse(direction == 1, start_gene, end_gene),
xend = ifelse(direction == 1, end_gene, start_gene)),
data = tab,
arrow = arrow(length = unit(0.1, "inches")), size = 2) +
geom_text(aes(x = start_gene, y = molecule, label = gene),
data = tab, nudge_y = 0.2) +
scale_y_discrete(limits = rev(levels(tab$molecule))) +
theme_minimal()
does someone have an idea in order to sort the geom_segment by the column end_scaffold (descending) (where scaffold_1254 should be on the top of the plot and scaffold_74038 shoudl be on the bottom).
here are the data
> dput(tab)
structure(list(molecule = structure(c(2L, 6L, 6L, 3L, 7L, 4L,
5L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "scaffold_1254", "scaffold_15158", "scaffold_7180",
"scaffold_74038", "scaffold_7638", "scaffold_8315"), class = "factor"),
gene = structure(c(8L, 6L, 5L, 3L, 7L, 4L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"G1", "G2", "G3", "G4", "G5", "G6", "G7"), class = "factor"),
start_gene = c(6708L, 9567L, 3456L, 10105L, 2760L, 9814L,
1476L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), end_gene = c(11967L, 10665L, 4479L, 10609L,
3849L, 10132L, 2010L, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), start_scaff = c(1L, 1L,
1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA), end_scaff = c(20072, 15336,
15336, 13487, 10827, 10155, 2010, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), strand = structure(c(2L,
2L, 3L, 2L, 3L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "backward",
"forward"), class = "factor"), direction = c(-1L, -1L, 1L,
-1L, 1L, -1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA)), row.names = c(7L, 5L, 4L, 2L,
6L, 3L, 1L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L,
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L), class = "data.frame")

For a solution inside ggplot you can remove the limits to the scale_y_discrete (it would reorder based on the factor levels) and use y = reorder(molecule, end_scaff) inside the aes:
library(dplyr)
library(ggplot2)
tab <- tab %>% filter(!is.na(start_gene))
ggplot(tab, aes(x = start_scaff, xend = end_scaff,
y = reorder(molecule, end_scaff), yend = molecule)) +
geom_segment(size = 3, col = "grey80") +
geom_segment(aes(x = ifelse(direction == 1, start_gene, end_gene),xend = ifelse(direction == 1, end_gene, start_gene)),
arrow = arrow(length = unit(0.1, "inches")), size = 2) +
geom_text(aes(x = start_gene, y = molecule, label = gene), nudge_y = 0.2) +
scale_y_discrete() +
theme_minimal()
Created on 2020-09-04 by the reprex package (v0.3.0)

The trick is to reorder the levels of molecule, not the entire data.frame. Instead of
tab <- tab[with(tab, order(-end_scaff,-end_gene)), ]
run
i <- with(tab, order(-end_scaff,-end_gene))
mol <- unique(tab$molecule[i])
tab$molecule <- factor(tab$molecule, levels = mol)
The same plotting code now produces the following graph.

Copy and insert row (with a minor change) in dataframe ABOVE a variable

I'm really new to R, and struggling with the following. If anyone could suggest where I look for a solution or point me in the right direction, I'd be forever grateful.
I have a dataset where I'd like to copy a row and insert that copy with an amendment (in this case appending ", USA) into the same dataframe when it find a value in the second column (a before and after dput are below).
I can find examples of duplicating row based on a regular pattern (ie. copy and insert every fourth row), but I'm not sure how I'd do that if the pattern isn't regular.
Any help would be greatly appreciated.
before = structure(list(Teams = structure(c(4L, 1L, 1L, 2L, 1L, 1L, 1L,
5L, 1L, 1L, 3L, 1L, 1L, 1L, 1L), .Label = c("", "Blue", "Green",
"Red", "Yellow"), class = "factor"), City = structure(c(1L, 2L,
1L, 1L, 4L, 1L, 1L, 1L, 5L, 1L, 1L, 3L, 1L, 1L, 1L), .Label = c("",
"California", "Chicago", "New York ", "Ohio"), class = "factor"),
Jan = c(NA, NA, 156.156, NA, NA, 818.87, 1586.4, NA, NA,
87.1, NA, NA, 873.4, 41.1, 1886.5), Feb = c(NA, NA, 1856,
NA, NA, 17.1, NA, NA, NA, NA, NA, NA, 48.8, NA, 187)), class = "data.frame", row.names = c(NA,
-15L))
after = structure(list(Teams = structure(c(4L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 5L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"Blue", "Green", "Red", "Yellow"), class = "factor"), City = structure(c(1L,
3L, 2L, 1L, 1L, 7L, 6L, 1L, 1L, 1L, 9L, 8L, 1L, 1L, 5L, 4L, 1L,
1L, 1L), .Label = c("", "California", "California, USA", "Chicago",
"Chicago, USA", "New York", "New York, USA", "Ohio", "Ohio, USA"
), class = "factor"), Jan = c(NA, NA, NA, 156.156, NA, NA, NA,
818.87, 1586.4, NA, NA, NA, 87.1, NA, NA, NA, 873.4, 41.1, 1886.5
), Feb = c(NA, NA, NA, 1856, NA, NA, NA, 17.1, NA, NA, NA, NA,
NA, NA, NA, NA, 48.8, NA, 187)), class = "data.frame", row.names = c(NA,
-19L))

will this work for you?
library(dplyr)
before %>% mutate(City = ifelse(City != "", paste0(City, ", USA"), ""))
Basically, you consider working around columns.
You can also use base R, which is more cumbersome. You need to convert your City to character first.
before$City = as.character(before$City)
before[before[, 2] != "", 2] = paste0(before[before[, 2] != "", 2], ",USA")
Edits:
I don't have an elegant way. This is an ugly for loop solution.
before$City = as.character(before$City)
df=NULL
for(i in 1:nrow(before)){
df=rbind(df,before[i,])
if(before[i,2]!=""){
before[i,2]=paste0(before[i,2], ",USA")
df=rbind(df,before[i,])
}
}
df

Tranpose and Calculate pearson correlation

I am really new to coding and I need to run a number of statistics in a dataset, for example the pearson correlation, but I am having some trouble manipulating the data.
From what I understood I need to transpose my data in order to calculate the pearson correlation, but here's where I'm having some problems. For starters, the column names turn into a new row instead of becoming the new column names. Then I get a message that my values are not numeric.
I also have some NA and I am trying to calculate the correlation with this code
cor(cr, use = "complete.obs", method = "pearson")
Error in cor(cr1, use = "complete.obs", method = "pearson") :
'x' must be numeric
I need to know the correlation between Victoria and Nuria which should yield 0.3651484
here is the dput of my dataset:
> dput(cr)
structure(list(User = structure(c(8L, 10L, 2L, 17L, 11L, 1L,
18L, 9L, 7L, 5L, 3L, 14L, 13L, 4L, 20L, 6L, 16L, 12L, 15L, 19L
), .Label = c("Ana", "Anton", "Bernard", "Carles", "Chris", "Ivan",
"Jim", "John", "Marc", "Maria", "Martina", "Nadia", "Nerea",
"Nuria", "Oriol", "Rachel", "Roger", "Sergi", "Valery", "Victoria"
), class = "factor"), Star.Wars.IV...A.New.Hope = c(1L, 5L, NA,
NA, 4L, 2L, NA, 4L, 5L, 4L, 2L, 3L, 2L, 3L, 4L, NA, NA, 4L, 5L,
1L), Star.Wars.VI...Return.of.the.Jedi = c(5L, 3L, NA, 3L, 3L,
4L, NA, NA, 1L, 2L, 1L, 5L, 3L, NA, 4L, NA, NA, 5L, 1L, 2L),
Forrest.Gump = c(2L, NA, NA, NA, 4L, 4L, 3L, NA, NA, NA,
5L, 2L, NA, 3L, NA, 1L, NA, 1L, NA, 2L), The.Shawshank.Redemption = c(NA,
2L, 5L, NA, 1L, 4L, 1L, NA, 4L, 5L, NA, NA, 5L, NA, NA, NA,
NA, 5L, NA, 4L), The.Silence.of.the.Lambs = c(4L, 4L, 2L,
NA, 4L, NA, 1L, 3L, 2L, 3L, NA, 2L, 4L, 2L, 5L, 3L, 4L, 1L,
NA, 5L), Gladiator = c(4L, 2L, NA, 1L, 1L, NA, 4L, 2L, 4L,
NA, 5L, NA, NA, NA, 5L, 2L, NA, 1L, 4L, NA), Toy.Story = c(2L,
1L, 4L, 2L, NA, 3L, NA, 2L, 4L, 4L, 5L, 2L, 4L, 3L, 2L, NA,
2L, 4L, 2L, 2L), Saving.Private.Ryan = c(2L, NA, NA, 3L,
4L, 1L, 5L, NA, 4L, 3L, NA, NA, 5L, NA, NA, 2L, NA, NA, 1L,
3L), Pulp.Fiction = c(NA, NA, NA, 4L, NA, 4L, 2L, 3L, NA,
4L, NA, 1L, NA, NA, 3L, NA, 2L, 5L, 3L, 2L), Stand.by.Me = c(3L,
4L, 1L, NA, 1L, 4L, NA, NA, 1L, NA, NA, NA, NA, 4L, 5L, 1L,
NA, NA, 3L, 2L), Shakespeare.in.Love = c(2L, 3L, NA, NA,
5L, 5L, 1L, NA, 2L, NA, NA, 3L, NA, NA, NA, 5L, 2L, NA, 3L,
1L), Total.Recall = c(NA, 2L, 1L, 4L, 1L, 2L, NA, 2L, 3L,
NA, 3L, NA, 2L, 1L, 1L, NA, NA, NA, 1L, NA), Independence.Day = c(5L,
2L, 4L, 1L, NA, 4L, NA, 3L, 1L, 2L, 2L, 3L, 4L, 2L, 3L, NA,
NA, NA, NA, NA), Blade.Runner = c(2L, NA, 4L, 3L, 4L, NA,
3L, 2L, NA, NA, NA, NA, NA, 2L, NA, NA, NA, 4L, NA, 5L),
Groundhog.Day = c(NA, 2L, 1L, 5L, NA, 1L, NA, 4L, 5L, NA,
NA, 2L, 3L, 3L, 2L, 5L, NA, NA, NA, 5L), The.Matrix = c(4L,
NA, 1L, NA, 3L, NA, 1L, NA, NA, 2L, 1L, 5L, NA, 5L, NA, 2L,
4L, NA, 2L, 4L), Schindler.s.List = c(2L, 5L, 2L, 5L, 5L,
NA, NA, 1L, NA, 5L, NA, NA, NA, 1L, 3L, 2L, NA, 2L, NA, 3L
), The.Sixth.Sense = c(5L, 1L, 3L, 1L, 5L, 3L, NA, 3L, NA,
1L, 2L, NA, NA, NA, NA, 4L, NA, 1L, NA, 5L), Raiders.of.the.Lost.Ark = c(NA,
3L, 1L, 1L, NA, NA, 5L, 5L, NA, NA, 1L, NA, 5L, NA, 3L, 3L,
NA, 2L, NA, 3L), Babe = c(NA, NA, 3L, 2L, NA, 2L, 2L, NA,
5L, NA, 4L, 2L, NA, NA, 1L, 4L, NA, 5L, NA, NA)), .Names = c("User",
"Star.Wars.IV...A.New.Hope", "Star.Wars.VI...Return.of.the.Jedi",
"Forrest.Gump", "The.Shawshank.Redemption", "The.Silence.of.the.Lambs",
"Gladiator", "Toy.Story", "Saving.Private.Ryan", "Pulp.Fiction",
"Stand.by.Me", "Shakespeare.in.Love", "Total.Recall", "Independence.Day",
"Blade.Runner", "Groundhog.Day", "The.Matrix", "Schindler.s.List",
"The.Sixth.Sense", "Raiders.of.the.Lost.Ark", "Babe"), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Can someone help me?

This code should give you the correlation matrix between all users.
cr2<-t(cr[,2:21]) # Transpose (first column contains names)
colnames(cr2)<-cr[,1] # Assign names to columns
cor(cr2,use="complete.obs") # Gives an error because there are no complete obs
# Error in cor(cr2, use = "complete.obs") : no complete element pairs
cor(cr2,use="pairwise.complete.obs") # use pairwise deletion
Correlation between Victoria and Nuria is 0.36514837 (using pairwise deletion)
Edit:To get just the correlation between Victoria and Nuria with listwise deletion, run the above and then
cr2<-as.data.frame(cr2)
with(cr2, cor(Victoria, Nuria, use = "complete.obs", method = "pearson"))
[1] 0.3651484

As a summary in addition to #Niek's answer. First transpose the data frame by t() by excluding first column (which contains the names and is not numeric and thus cannot used for correlation calculations); assign these names to new columns in same step. Then calculate specific correlations. The solution in one piece would be:
cr2 <- setNames(as.data.frame(t(cr[, -1])), cr[, 1])
with(cr2, cor(Victoria, Nuria, use = "complete.obs"))
[1] 0.3651484
Or for the whole correlation matrix:
cor(cr2, use = "pairwise.complete.obs")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

y -axis scale as percent in ggplot - r

Related

Convert multiple columns from numeric to factor

Label single bars in ggplot

set factor level in ggplot2

Copy and insert row (with a minor change) in dataframe ABOVE a variable

Tranpose and Calculate pearson correlation

Categories

Resources