I have a data set with a variable 'education' which is coded differently in each of the three countries included, for example:
Code
Country 1
Country 2
Country 3
1
No education
No education
No education
2
Primary
Primary
Islamic education
3
Secondary
Secondary
Primary
4
NA
NA
Secondary
I need to apply factor levels, which are different for each country.
Below is my attempt, but it doesn't appear to work:
df <- data.frame(
Country = sample(c("Country 1", "Country 2", "Country 3"), 100, replace = TRUE),
Education_1 = sample(1:4)
)
df$Education <-
if(df$Country == "Country1") {
factor(df$Education,
levels = c(1:4),
labels = c("No education", "Primary", "Secondary", "NA"))
} else if (df$Country == "Country2") {
factor(df$Education,
levels = c(1:4),
labels = c("No education", "Primary", "Secondary", "NA"))
} else {
factor(df$Education,
levels = c(1:4),
labels = c("No education", "Islamic education", "Primary", "Secondary")
)
}
Thanks
Perhaps this helps? This takes the data from the table mapping countries with the education code and the education category and converts it to long format.
Then use a left join to the two column dataframe with countries and education codes.
You could use the resulting column with education type as a string or the codes could be recoded to be consistent.
library(dplyr)
library(tidyr)
library(stringr)
df <- data.frame(
Country = sample(c("Country 1", "Country 2", "Country 3"), 100, replace = TRUE),
Education_1 = sample(1:4))
df_ed <- structure(list(Code = 1:4, Country.1 = c("No education", "Primary",
"Secondary", NA), Country.2 = c("No education", "Primary", "Secondary",
NA), Country.3 = c("No education", "Islamic education", "Primary",
"Secondary")), class = "data.frame", row.names = c(NA, -4L))
df_levels <-
df_ed %>%
pivot_longer(-Code) %>%
mutate(name = str_replace(name, "\\.", " "))
df1 <-
df %>%
left_join(df_levels, by = c("Country" = "name", "Education_1" = "Code"))
head(df1)
#> Country Education_1 value
#> 1 Country 1 3 Secondary
#> 2 Country 2 4 <NA>
#> 3 Country 3 1 No education
#> 4 Country 1 2 Primary
#> 5 Country 3 3 Primary
#> 6 Country 2 4 <NA>
Created on 2021-09-22 by the reprex package (v2.0.0)
Related
I can separate (using ", ") a column into multiple column.
The idea is to reverse the order of words (separated by ", ") and then separate them into multiple columns. Example of reversing - "CA, SF" becomes "SF, CA"
Below is an example
library(tidyverse)
# sample example
tbl <- tibble(
letter = c("US, CA, SF","NYC", "Florida, Miami")
)
# desired result
tbl_desired <- tibble(
country = c("US", NA, NA),
state = c("CA", NA, "Florida"),
city = c("SF", "NYC", "Miami")
)
# please edit it to get the desired result
tbl %>%
# please add line to reverse the string
mutate() %>%
separate(letter, into = c("country", "state", "city"), sep = ", ")
There is fill argument in separate which can be used (by default, it is "warn"), but we can change that to either "right" or "left". Here, it should be filled from the "left"
library(tidyr)
separate(tbl, letter, into = c("country", "state", "city"),
sep = ", ", fill = "left")
-output
# A tibble: 3 × 3
country state city
<chr> <chr> <chr>
1 US CA SF
2 <NA> <NA> NYC
3 <NA> Florida Miami
If i have this tibble:
tibble(
period = c("2010END", "2011END",
"2010Q1","2010Q2","2011END"),
date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010'),
website = c(
"google",
"google",
"facebook",
"facebook",
"youtube"
),
method = c("website",
"phone",
"website",
"laptop",
"phone"),
values = c(1, NA, 1, 2, 3))
And then i have this dataframe which tells you which quantiles to create along with the rankings to be made from the ranks:
tibble(
method = c(
"phone",
"phone",
"phone",
"website",
"website",
"website",
"laptop",
"laptop",
"laptop"
),
rank = c(3,2,1,3,2,1,3,2,1),
tile_condition = c("lowest 25%", "25 to 50%", "more than 50%",
"highest 25%", "25 to 50%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")
)
How can i use a case_when statement to correctly allow myself to create a ranking column which is based on the quartile calculation from the values column in the first dataframe?
I'm trying to apply the quantiles from the other dataframe to create a ranking column in the original dataframe - stuck on how to use case_when for it.
I would do something like this:
set.seed(124)
left_join(
df1[sample(1:5,1000, replace=T),] %>%
mutate(values=sample(c(df1$values,1:30),1000, replace=T)) %>%
group_by(method) %>%
mutate(q=as.double(cut(values,quantile(values,probs=seq(0,1,0.25), na.rm=T), labels=c(1:4), include.lowest=T))) %>%
ungroup(),
df2 %>% mutate(q = list(1,2,c(3,4),4,c(2,3),1,1,2,c(3,4))) %>% unnest(q),
by=c("method", "q")
) %>% select(-q)
Output:
# A tibble: 1,000 × 7
period date website method values rank tile_condition
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 2010END 31-12-2010 google website 7 2 25 to 75%
2 2011END 31-12-2011 google phone 18 1 more than 50%
3 2010Q1 30-04-2010 facebook website 21 2 25 to 75%
4 2011END 30-09-2010 youtube phone 15 1 more than 50%
5 2011END 30-09-2010 youtube phone 26 1 more than 50%
6 2011END 31-12-2011 google phone 3 3 lowest 25%
7 2010END 31-12-2010 google website 1 1 less than 25%
8 2010Q1 30-04-2010 facebook website 2 1 less than 25%
9 2010Q2 31-07-2010 facebook laptop 14 2 25 to 50%
10 2010Q2 31-07-2010 facebook laptop 16 1 more than 50%
# … with 990 more rows
Notice that I updated your input to 1000 rows, and random new values for the purposes of illustration. Also, notice that I fixed df2, so that method website covers the full range of values. In your example the 50% to 75% quartile is missing.
Adjusted df2 input:
structure(list(method = c("phone", "phone", "phone", "website",
"website", "website", "laptop", "laptop", "laptop"), rank = c(3,
2, 1, 3, 2, 1, 3, 2, 1), tile_condition = c("lowest 25%", "25 to 50%",
"more than 50%", "highest 25%", "25 to 75%", "less than 25%",
"lowest 25%", "25 to 50%", "more than 50%")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L))
If I've understood your question correctly, you first have to create a table to compare against, as:
df_quants <-
df1 %>%
drop_na(values) %>%
group_by(method) %>%
summarize(quant25 = quantile(values, probs = 0.25),
quant50 = quantile(values, probs = 0.5),
quant75 = quantile(values, probs = 0.75),
quant100 = quantile(values, probs = 1))
Then, using a join and a case_when statement, you would arrive at:
df2 %>%
left_join(df_quants, by = 'method') %>%
mutate(tiles =
case_when(rank < quant25 ~ 'lowest 25%',
rank < quant50 ~ '25 to 50%',
rank < quant75 ~ 'more than 50%',
rank >= quant75 ~ 'highest 25%'))
Here's a quick version. It doesn't get the exact labels you want. To do that you'd have to parse your tile_condition column, which is a little tricky.
library(santoku)
df |>
group_by(method) |>
mutate(
quantile = chop_quantiles(values, c(0.25, 0.50),
labels = c("Lowest 25%", "25 to 50%", "Above 50%"), extend = TRUE)
)
i have data that looks like this :
in total 38 columns .
data code sample :
df <- structure(
list(
Christensenellaceae = c(
0.010484508,
0.008641566,
0.010017172,
0.010741488,
0.1,
0.2,
0.3,
0.4,
0.7,
0.8,
0.9,
0.1,
0.3,
0.45,
0.5,
0.55
),
Date=c(27,27,27,27,27,27,27,27,28,28,28,28,28,28,28,28),
Treatment = c(
"Treatment 1",
"Treatment 1",
"Treatment 1",
"Treatment 1",
"Treatment 2",
"Treatment 2",
"Treatment 2",
"Treatment 2",
"Treatment 1",
"Treatment 1",
"Treatment 1",
"Treatment 1",
"Treatment 2",
"Treatment 2",
"Treatment 2",
"Treatment 2"
)
),class = "data.frame",
row.names = c(NA,-9L)
)
whay i wish to do is to create kendall correlation matrix (the data doesnt have linear behavor) between the treatment types(10 in total but 2 in example)for every column (except treatment and date) so in total 36 correlation matrix with size 1010 (here will be 22) .
this is my code:
res2 <- cor(as.matrix(data),method ="kendall")
but i get the error:
Error in cor(data, method = "kendall") : 'x' must be numeric
is there any way to solve this ? thank you:)
You can do that using a tidyverse approach by first making some data wrangling and then using correlate to calculate the correlation in pairs for every combination of variables.
library(corrr)
library(tidyverse)
df |>
# Transform data into wide format
pivot_wider(id_cols = Date,
names_from = Treatment,
values_from = -starts_with(c("Treatment", "Date"))) |>
# Unnest lists inside each column
unnest(cols = starts_with("Treatment")) |>
# Remove Date from the columns
select(-Date) |>
# Correlate all columns using kendall
correlate(method = "kendall")
# A tibble: 2 x 3
# term `Treatment 1` `Treatment 2`
# <chr> <dbl> <dbl>
#1 Treatment 1 NA 0.546
#2 Treatment 2 0.546 NA
I want to create a funnelarea chart using plotly, this is the example from plotly:
fig <- plot_ly(
type = "funnelarea",
values = c(5, 4, 3, 2, 1),
text = c("The 1st","The 2nd", "The 3rd", "The 4th", "The 5th"),
marker = list(colors = c("deepskyblue", "lightsalmon", "tan", "teal", "silver"),
line = list(color = c("wheat", "wheat", "blue", "wheat", "wheat"), width = c(0, 1, 5,
0, 4))),
textfont = list(family = "Old Standard TT, serif", size = 13, color = "black"),
opacity = 0.65)
fig
I would like to use a dataframe to fill this chart, use categories from my dataframe columns instead of text and values but I can't find the way to to it.
Example of my dataframe
funnel_stage size purchaser_payment
1. Available for Sale 10 10000
1. Available for Sale 15 15000
2. Pending on Sale 8 8000
2. Pending on Sale 9 9000
3. Already Sold 1 1000
3. Already Sold 45 45000
3. Already Sold 12 12000
I would like my funnel filled counting the number of times of repetition of first column, It would be like:
It's probably easiest if you first bring your data into a shape with one row per category in the correct order, see df2 below.
library("dplyr")
library("plotly")
df <- structure(
list(funnel_stage = c("Available for Sale", "Available for Sale",
"Pending on Sale", "Pending on Sale", "Already Sold",
"Already Sold", "Already Sold"),
size = c(10L, 15L, 8L, 9L, 1L, 45L, 12L),
purchaser_payment = c(10000L, 15000L, 8000L, 9000L, 1000L, 45000L, 12000L)),
class = "data.frame", row.names = c(NA, -7L))
df$funnel_stage <- factor(df$funnel_stage, levels = c("Available for Sale",
"Pending on Sale",
"Already Sold"))
df2 <- df %>%
group_by(funnel_stage) %>%
count()
df2
#> # A tibble: 3 x 2
#> # Groups: funnel_stage [3]
#> funnel_stage n
#> <fct> <int>
#> 1 Available for Sale 2
#> 2 Pending on Sale 2
#> 3 Already Sold 3
plot_ly() %>%
add_trace(
type = "funnelarea",
values = df2$n,
text = df2$funnel_stage)
packageVersion("dplyr")
#> [1] '0.8.5'
packageVersion("plotly")
#> [1] '4.9.2.1'
When I try to run the following code I get an error:
value <- as.matrix(wsu.wide[, c(4, 3, 2)])
Error in [.data.frame(wsu.wide, , c(4, 3, 2)) : undefined columns
selected
How do I get this line of work? It's part of dcasting my data.
This is full the code:
library(readxl)
library(reshape2)
Store_and_Regional_Sales_Database <- read_excel("~/Downloads/Data_Files/Store and Regional Sales Database.xlsx", skip = 2)
store <- Store_and_Regional_Sales_Database
freq <- table(store$`Sales Region`)
freq
rel.freq <- freq / nrow(store)
rel.freq
rel.freq.scaled <- rel.freq * 100
rel.freq.scaled
labs <- paste(names(rel.freq.scaled), "\n", "(", rel.freq.scaled, "%", ")", sep = "")
pie(rel.freq.scaled, labels = labs, main = "Pie Chart of Sales Region")
monitor <- store[which(store$`Item Description` == '24" Monitor'),]
wsu <- as.data.frame(monitor[c("Week Ending", "Store No.", "Units Sold")])
wsu.wide <- dcast(wsu, "Store No." ~ "Week Ending", value.var = "Units Sold")
value <- as.matrix(wsu.wide[, c(4, 3, 2)])
Thanks.
Edit:
This is my table called "monitor":
When I then make this wsu <- as.data.frame(monitor[c("Week Ending", "Store No.", "Units Sold")]) I create another vector with only variables "Week Ending", "Store No." and "Units Sold".
However, as I write the wsu.wide code the ouput I get is only this:
Why do I only get this small table when I'm asking to dcast my data?
After this I don't get what is wrong.
The problem is at the line:
wsu.wide <- dcast(wsu, "Store No." ~ "Week Ending", value.var="Units Sold")
Instead of the double quotation mark " you should use the grave accent - ` in the formula:
wsu.wide <- dcast(wsu, `Store No.` ~ `Week Ending`, value.var = "Units Sold")
To avoid this kind of problem it is better not to use spaces in the R object names it is better to substitute Sales Region variable name to sales_region using underscore. See e.g. Google's R Style Guide.
Please see the code below, I used simulation of your data as extract it from the picture is quite cumbersome:
library(readxl)
library(reshape2)
#simulation
n <- 4
Store_and_Regional_Sales_Database <- data.frame(
a = seq_along(LETTERS[1:n]),
sr = LETTERS[1:n],
sr2 = '24" Monitor',
sr3 = 1:4,
sr4 = 2:5,
sr5 = 3:6)
names(Store_and_Regional_Sales_Database)[2:6] <- c(
"Sales Region", "Item Description",
"Week Ending", "Store No.", "Units Sold")
# algorithm
store <- Store_and_Regional_Sales_Database
freq <- table(store$`Sales Region`)
freq
rel.freq <- freq/nrow(store)
rel.freq
rel.freq.scaled <- rel.freq * 100
rel.freq.scaled
labs <- paste(names(rel.freq.scaled), "\n", "(", rel.freq.scaled, "%", ")", sep = "")
pie(rel.freq.scaled, labels = labs, main = "Pie Chart of Sales Region")
monitor <- store[which(store$`Item Description` == '24" Monitor'),]
wsu <- as.data.frame(monitor[c("Week Ending", "Store No.", "Units Sold")])
wsu.wide <- dcast(wsu, `Store No.` ~ `Week Ending`, value.var = "Units Sold")
value <- as.matrix(wsu.wide[ ,c(4,3,2)])
Output:
3 2 1
[1,] NA NA 3
[2,] NA 4 NA
[3,] 5 NA NA
[4,] NA NA NA