Related
Here is a simplified version of the data I am working with:
data.frame(country = c("country1", "country2", "country3", "country1", "country2"), measurement = c("m1", "m1", "m1", "m2", "m2"),
y2015 = c(NA, 15, 19, 13, 55), y2016 = c(NA, 17, NA, 10, NA), y2017 = c(14, NA, NA, 9, 45), y2018 = c(18, 22, 16, NA, 40))
I am trying to take the difference between the two non-missing variables on either side of the NAs, and replace the missing values with the average of the differences over time.
For row 5, this would be something like c(55, 50, 45, 40).
However, it also needs to work for the rows that have more than one missing value in a sequence, like row 1 and row 3. For row 1, I'd like the difference between 14 and 18 to be interpolated, and so it should look something like c(6, 10, 14, 18). Meanwhile, for row 3, the difference between 19-13 divided between the two missing years, to look something like c(19, 18, 17, 16).
Essentially, I'm looking to create a slope for each country and measurement through the available years, and interpolating missing variables based on that.
I am trying to think of a package for this or perhaps create a loop. I have looked at the package 'spline' but does not seem to work since I want to run separate linear interpolation based on country and measurement.
Any thoughts would be greatly appreciated!
Use zoo::na.spline:
library(zoo)
dat[-c(1:2)] <- t(na.spline(t(dat[-c(1:2)])))
country measurement y2015 y2016 y2017 y2018
1 country1 m1 6 10 14.00000 18
2 country2 m1 15 17 19.33333 22
3 country3 m1 19 18 17.00000 16
4 country1 m2 13 10 9.00000 10
5 country2 m2 55 50 45.00000 40
-------------------NEW POST:
I've posted incorrect example of my data in past (leaving it below). In reality my data has repetitive "Modules" under same column and previous solution doesn't work for my problem.
My example data (current dataset):
Year <- c("2013", "2020", "2015", "2012")
Grade <- c(28, 39, 76, 54)
Code <- c("A", "B", "C", "A")
Module1 <- c("English", "English", "Science", "English")
Results1 <- c(45, 58, 34, 54)
Module2 <- c("History", "History", "History", "Art")
Results2 <- c(12, 67, 98, 45)
Module3 <- c("Art", "Geography", "Math", "Geography")
Results3 <- c(89, 84, 45, 67)
Module14 <- c("Math", "Math", "Geography", "Art")
Results14 <- c(89, 24, 95, 67)
Module15 <-c("Science", "Art", "Art", "Science")
Results15 <-c(87, 24, 25, 67)
daf <- data.frame(Id, Year, Grade, Code, Module1, Results1, Module2, Results2, Module3, Results3, Module14, Results14, Module15, Results15)
My target - dataset I need to achieve:
Year <- c("2013", "2020", "2015", "2012")
Grade <- c(28, 39, 76, 54)
Code <- c("A", "B", "C", "A")
English <- c(45, 58,NA,54)
Math <- c(89, 24,45, NA)
Science <- c(87, NA, 34, 67)
Geography <- c(NA, 84, 95,67)
Art <- c(89,24,25,45)
wished_df <- data.frame(Id, Year, Grade, Code, English, Math, Science,Geography, Art)
Thanks again for any help!
-------------------------------- OLD POST:
I am trying to reshape my current data to new format.
Module1 <- c("English", "Math", "Science", "Geography")
Results1 <- c(45, 58, 34, 54)
Module2 <- c("Math", "History", "English", "Art")
Results2 <- c(12, 67, 98, 45)
Module3 <- c("History", "Art", "English", "Geography")
Results3 <- c(89, 84, 45, 67)
daf <- data.frame(Module1, Results1, Module2, Results2, Module3, Results3)
What I need is module names set as ‘variable names’, and module results set as ‘values for variable names’, looking like:
English1 <- c(45, 98, 45)
Math1 <- c(58, 12, NA)
Science1 <- c(34, NA, NA)
Geography1 <- c(54,NA, 67)
Art1 <- c(NA, 45, 84)
wished_df <- data.frame(English1, Math1, Science1,Geography1, Art1)
Thank you for any ideas.
1) reshape Using the data in the Note at the end, split the input column names into two groups (Module columns and Results columns) giving varying. Using that reshape to long form where varying= defines which columns in the input correspond to a single column in the long form. v.names= specifies the names to use for each of the two columns produced from the varying columns. reshape will give a data frame with columns time, Module, Result and id columns. We don't need the id column so drop it using [-4].
Then reshape that back to the new wide form. idvar= specifies the source of the output rows and timevar= specifies the source of the output columns. Everything else is the body of the result. reshape will generate a time column which we don't need so remove it using [-1]. At the end we remove the junk part of each column name.
No packages are used.
varying <- split(names(daf), sub("\\d+$", "", names(daf)))
long <- reshape(daf, dir = "long", varying = varying, v.names = names(varying))[-4]
wide <- reshape(long, dir = "wide", idvar = "time", timevar = "Module")[-1]
names(wide) <- sub(".*[.]", "", names(wide))
giving:
> wide
English Math Science Geography History Art
1.1 45 58 34 54 NA NA
1.2 98 12 NA NA 67 45
1.3 45 NA NA 67 89 84
2) pivot_ Using the data in the Note at the end, specify that all columns are to be used and using .names specify that the column names in long form are taken from the first portion of the column names of the input where the names of the input are split according to the names_pattern= regular expression. Then pivot to a new wide form where the column names are taken from the Module column and the values in the body of the result are taken from the Results column. The index column will define the rows and can be omitted afterwards.
library(dplyr)
library(tidyr)
daf %>%
pivot_longer(everything(), names_to = c(".value", "index"),
names_pattern = "(\\D+)(\\d+)") %>%
pivot_wider(names_from = Module, values_from = Results) %>%
select(-index)
giving:
# A tibble: 3 x 6
English Math History Art Science Geography
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 45 58 NA NA 34 54
2 98 12 67 45 NA NA
3 45 NA 89 84 NA 67
3) unlist/tapply UUsing the data in the Note at the end, another base solution can be fashioned by separately unlisting the Module and Results columns to get the long form and using tapply to convert to wide form. No packages are used
is_mod <- grepl("Module", names(daf))
long <- data.frame(Module = unlist(daf[is_mod]), Results = unlist(daf[!is_mod]))
tab <- tapply(long$Results, list(sub("\\d+$", "", rownames(long)), long$Module), sum)
as.data.frame.matrix(tab)
giving:
Art English Geography History Math Science
Module1 NA 45 54 NA 58 34
Module2 45 98 NA 67 12 NA
Module3 84 45 67 89 NA NA
Note
Module1 <- c("English", "Math", "Science", "Geography")
Results1 <- c(45, 58, 34, 54)
Module2 <- c("Math", "History", "English", "Art")
Results2 <- c(12, 67, 98, 45)
Module3 <- c("History", "Art", "English", "Geography")
Results3 <- c(89, 84, 45, 67)
daf <- data.frame(Module1, Results1, Module2, Results2, Module3, Results3)
A data.table version:
library(data.table)
library(magrittr)
dt <- as.data.table(daf)
dt %>%
melt.data.table(measure.vars = patterns("^Module", "^Result")) %>%
dcast.data.table(variable ~ ..., value.var = "value2")
giving:
Key: <variable>
variable Art English Geography History Math Science
<fctr> <num> <num> <num> <num> <num> <num>
1: 1 NA 45 54 NA 58 34
2: 2 45 98 NA 67 12 NA
3: 3 84 45 67 89 NA NA
i have a column called reported age that can range from 0 to 100.
Report age|
5
82
17
39
67
I would like to create a script that assigns a new column called Age Group
Report age|Age Group|
5 5 to 9
82 80 to 84
17 15 to 19
39 35 to 39
67 64 to 69
I know if i have
df <-df %>%
mutate(Age_Group = ifelse(`Report age` <5, "Under 5", No)
I will get two outcomes. I want to set up way more. Under 5, 5 to 9, 10 to 14, 15 to 19, and so on until "85 years and over".
We can use cut to create the group
library(dplyr)
brks <- c(5, 9, 15, 35, 39, 64, 69, 80, 84)
df %>%
mutate(Age_Group = cut(`Report age`,
breaks = c(-Inf, brks, Inf),
labels = c("under 5", paste(head(brks, -1),
" to ", tail(brks, -1)), "85 years and over")))
I have a dataset called CSES (Comparative Study of Electoral Systems) where each row corresponds to an individual (one interview in a public opinion survey), from many countries, in many different years .
I need to create a variable which identifies the ideology of the party each person voted, as perceived by this same person.
However, the dataset identifies this perceived ideology of each party (as many other variables) by letters A, B, C, etc. Then, when it comes to identify WHICH PARTY each person voted for, it has a UNIQUE CODE NUMBER, that does not correspond to these letters across different years (i.e., the same party can have a different letter in different years – and, of course, it is never the same party across different countries, since each country has its own political parties).
Fictitious data to help clarify, reproduce and create a code:
Let’s say:
country = c(1,1,1,1,2,2,2,2,3,3,3,3)
year = c (2000,2000,2004,2004, 2002,2002,2004,2008,2000,2000,2000,2000)
party_A_number = c(11,11,12,12,21,21,22,23,31,31,31,31)
party_B_number = c(12, 12, 11, 11, 22,22,21,22,32,32,32,32)
party_C_number = c(13,13,13,13,23,23,23,21,33,33,33,33)
party_voted = c(12,13,12,11,21,24,23,22,31,32,33,31)
ideology_party_A <- floor(runif (12, min=1, max=10))
ideology_party_B <- floor(runif (12, min=1, max=10))
ideology_party_C <- floor(runif (12, min=1, max=10))
Let’s call the variable I want to create “ideology_voted”:
I need something like:
IF party_A_number == party_voted THEN ideology_voted = ideology_party_A
IF party_B_number == party_voted, THEN ideology_voted == ideology_party_B
IF party_C_number == party_voted, THEN ideology_voted == ideology_party_C
The real dataset has 9 letters for (up to) 9 main parties in each country , dozens of countries and election-years. Therefore, it would be great to have a code where I could iterate through letters A-I instead of “if voted party A, then …; if voted party B then….”
Nevertheless, I am having trouble even when I try longer, repetitive codes (one transformation for each party letter - which would give me 8 lines of code)
library(tidyverse)
df <- tibble(
country = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
year = c(2000, 2000, 2004, 2004, 2002, 2002, 2004, 2008, 2000, 2000, 2000, 2000),
party_A_number = c(11, 11, 12, 12, 21, 21, 22, 23, 31, 31, 31, 31),
party_B_number = c(12, 12, 11, 11, 22, 22, 21, 22, 32, 32, 32, 32),
party_C_number = c(13, 13, 13, 13, 23, 23, 23, 21, 33, 33, 33, 33),
party_voted = c(12, 13, 12, 11, 21, 24, 23, 22, 31, 32, 33, 31),
ideology_party_A = floor(runif (12, min = 1, max = 10)),
ideology_party_B = floor(runif (12, min = 1, max = 10)),
ideology_party_C = floor(runif (12, min = 1, max = 10))
)
> df
# A tibble: 12 x 9
country year party_A_number party_B_number party_C_number party_voted ideology_party_A ideology_party_B
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2000 11 12 13 12 9 3
2 1 2000 11 12 13 13 2 6
3 1 2004 12 11 13 12 3 8
4 1 2004 12 11 13 11 7 8
5 2 2002 21 22 23 21 2 7
6 2 2002 21 22 23 24 8 2
7 2 2004 22 21 23 23 1 7
8 2 2008 23 22 21 22 7 7
9 3 2000 31 32 33 31 4 3
10 3 2000 31 32 33 32 7 5
11 3 2000 31 32 33 33 1 6
12 3 2000 31 32 33 31 2 1
# ... with 1 more variable: ideology_party_C <dbl>
It seems you're after conditioning using case_when:
ideology_voted <- df %>% transmute(
ideology_voted = case_when(
party_A_number == party_voted ~ ideology_party_A,
party_B_number == party_voted ~ ideology_party_B,
party_C_number == party_voted ~ ideology_party_C,
TRUE ~ party_voted
)
)
> ideology_voted
# A tibble: 12 x 1
ideology_voted
<dbl>
1 3
2 7
3 3
4 8
5 2
6 24
7 8
8 7
9 4
10 5
11 6
12 2
Note that the evaluation of case_when is lazy, so the first true condition is used (if it happens that more than one is actually true, say).
I know that this question has been answered a few times here and here. The key point seems to be to include the argument label in aes. But, for me ggplot does not accept label as an argument in aes. I tried using the generic function labels in aes as below, but that didn't work to create labels for points, though I am able to generate a graph:
launch_curve <- ggplot(data = saltsnck_2002_plot_t,aes(x=weeks,y=markets, labels(c(1,2,3,4,5,6,7,8,9,10,11,12))))+
geom_line()+geom_point()+
scale_x_continuous(breaks = seq(0,12,by=1))+
scale_y_continuous(limits=c(0,50), breaks = seq(0,50,by=5))+
xlab("Weeks since launch")+ ylab("No. of markets")+
ggtitle(paste0(marker1,marker2))+
theme(plot.title = element_text(size = 10))
print(launch_curve)
Does anyone know a way around this? I am using R version 3.4.3.
Edited to include sample data:
The data that I use to plot is in the dataframe saltsnck_2002_plot_t. (12 rows by 94 cols). A sample is given below:
>saltsnck_2002_plot_t
11410008398 11600028960 11819570760 11819570761 12325461033 12325461035 12325461037
Week1 3 2 2 1 2 2 1
Week2 6 16 10 1 3 2 2
Week3 11 41 13 10 3 3 2
Week4 15 46 14 14 3 4 4
Week5 15 48 15 14 3 4 4
Week6 27 48 15 15 3 4 4
Week7 31 50 15 15 3 4 5
Week8 33 50 16 16 5 5 6
Week9 34 50 18 16 5 5 6
Week10 34 50 21 19 5 5 6
Week11 34 50 23 21 5 5 6
Week12 34 50 24 23 5 5 6
I am actually plotting graphs in a loop by moving through the columns of the dataframe. This dataframe is the result of a transpose operation, hence the weird row and column names. The graph for the first column looks like the one below. And a correction from my earlier post, I need to capture as data labels the values in the column and not c(1,2,3,4,5,6,7,8,9,10,11,12).
Use geom_text
library(ggplot2)
ggplot(data = df,aes(x=weeks_num,y=markets))+
geom_line() + geom_point() + geom_text(aes(label = weeks), hjust = 0.5, vjust = -1) +
scale_y_continuous(limits=c(0,50), breaks = seq(0,50,by=5)) +
scale_x_continuous(breaks = seq(1,12,by=1),labels = weeks_num)+
xlab("Weeks since launch")+ ylab("No. of markets")+
ggtitle(paste0(markets))+
theme(plot.title = element_text(size = 10))
Data
df <- structure(list(weeks_num = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12), weeks = structure(1:12, .Label = c("week1", "week2", "week3",
"week4", "week5", "week6", "week7", "week8", "week9", "week10",
"week11", "week12"), class = c("ordered", "factor")), markets = c(3,
6, 11, 15, 27, 31, 33, 34, 34, 34, 34, 34)), .Names = c("weeks_num",
"weeks", "markets"), row.names = c(NA, -12L), class = "data.frame")