R Convert categorical data to dummy set by other variable - r

I have this data set, I put a screenshot of real data instead of a code or something.
sorry for messing up, I am a newbie here in R
enter image description here
Then, I want to change the data into dummy set for "13 Source" categorical data, but it has to be summarized by "HH No". Which will look like this
enter image description here
I've tried to use to.dummy by varhandle, model.matrix but ended up messy dataset.
Could anybody help me how to deal with this?
Thanks a million in advance

There are a number of ways to make dummy variables from factors - here is one way to create a summary presence table.
Assume df is your data frame. You can use xtabs to start with, which will create a frequency table from your 2 columns.
By comparing to see if your values are > 0, you will get TRUE if > 0, and FALSE otherwise. Adding 0 at the end will make TRUE the number 1 and FALSE the number 0.
(xtabs(~ HH_No + Source, df) > 0) + 0
Output
Source
HH_No Deep_well Rainwater
1 1 1
3 1 1
4 0 1
Data
df <- structure(list(HH_No = c(1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3,
3, 3, 4, 4), Source = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), .Label = c("Deep_well",
"Rainwater"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L))

Related

Mosaic plot: Discrete value supplied to continuous scale

I am engaging in a project about the relationship btw perception of voters and their voting behavior
The x variable - perception (PQ8_W3) is categorical variables with 4 values:
(1) winner
(2) loser
(3) can't say
(9) don't know
whereas y variable (same) is a 1/0 dummy.
I know x is a factor and I tried to change it to numeric, but I still cannot generate the mosaic graph. Also, I changed 9 to 4 so as to make it into continuous, but still doesn't work out.
Here is the dput data
c(NA, NA, NA, NA, 1, NA, NA, 1, 1, 0, 0, 1, 1, 1, 1)
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 1L,
1L, 1L), .Label = c("1", "2", "3", "9"), class = "factor")
This is my code
library(ggplot2)
library(ggmosaic)
# here is fine, but I want to change the x and y variables
ggplot(data = AfD) + geom_mosaic(aes(x = product(same), fill=PQ8_W3), na.rm=TRUE)
# the error message comes out here, the console just gives me "Discrete value supplied to continuous scale"
ggplot(data = AfD) + geom_mosaic(aes(x = product(PQ8_W3), fill=(same), na.rm=TRUE))
Hope someone can help.

Adding geom_line mean to reordered geom_point plot in ggplot

Trying to produce a point plot that reorders my values and also has a mean line above the values.
I can produce the plot with the mean line, or the reordered values but not both at the same time because I get the error
"geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?".
I believe I am getting the error as some of my data only has one observation but I don't understand why this only becomes an issue with the reorder data.
In the end all I want is to be able to show the means of the two different values groups for each x value.
Here is my sample code
library(ggplot2)
typ <- c("T", "N", "T", "T", "N")
samplenum <- c(7,7,6,8,8)
values <- c(1,2,1,3,2)
df = data.frame(typ, samplenum, values)
d <- ggplot(df, aes(x= reorder(samplenum, values), y= values))
d <- d + geom_point(position=position_jitter(width=0.15, height=0.05))
d <- d + aes(colour = factor(df$typ))
d <- d + stat_summary(fun.y = mean, geom="line")
d
Thank you for the help in advance.
This is what I am going for
Here is some steps before the completion sample pictures of what I have produced from my larger data set.
With Line but Not Reordered
Reordered but No Mean Line
As the error message suggests, you need to adjust the group aesthetic. When you use reorder you will end up with a discrete scale but you want to draw lines that connect across groups, that's why the error.
You can try this
ggplot(df, aes(x = reorder(samplenum, values), y = values, colour = factor(typ))) +
geom_jitter(width = 0.15, height = 0.05) +
stat_summary(fun.y = mean, geom = "line", aes(group = factor(typ)))
(I altered your data slighly so it contains more observations.)
data
df <- structure(list(typ = structure(c(2L, 1L, 2L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L), .Label = c("N", "T"), class = "factor"),
samplenum = c(7, 7, 6, 8, 8, 7, 7, 6, 8, 8, 7, 7, 6, 8, 8
), values = c(1L, 3L, 2L, 1L, 3L, 3L, 1L, 3L, 2L, 2L, 2L,
1L, 3L, 1L, 2L)), .Names = c("typ", "samplenum", "values"
), row.names = c(NA, -15L), class = "data.frame")
The resulting plot with your input data

Casting dataframe gives error R

Here is the dataframe df on which I'm trying to do a pivot using cast function
dput(df)
structure(list(Val = c(1L, 2L, 2L, 5L, 2L, 5L), `Perm 1` = structure(c(1L,
2L, 3L, 3L, 3L, 3L), .Label = c("Blue", "green", "yellow"
), class = "factor"), `Perm 2` = structure(c(1L, 2L, 2L, 3L,
3L, 3L), .Label = c("Blue", "green", "yellow"), class = "factor"),
`Perm 3` = structure(c(1L, 2L, 2L, 2L, 3L, 3L), .Label = c("Blue",
"green", "yellow"), class = "factor")), .Names = c("Val",
"Perm 1", "Perm 2", "Perm 3"), row.names = c(NA, 6L), class = "data.frame")
And expecting the data after pivot
Blue 1 1 1
green 2 4 9
yellow 14 12 7
I tried doing
cast(df, df$Val ~ df$`Perm 1`+df$`Perm 2`+df$`Perm 3`, sum, value = 'Val')
But this gives error
Error: Casting formula contains variables not found in molten data: df$Val, df$`Perm1`, df$`Perm2`
How can I be able to do pivot so that I'll be able to get the desired O/P
P.S- The dataframe DF has around 36 column but for simplicity I took only 3 columns.
Any suggestion will be appreciated.
Thank you
Domnick
It appears you want to sum, grouped by each permutation in your dataset. Although hacky, I think this works for your problem. First we create a function to perform that summation using tidyeval syntax. Link for more information: Group by multiple columns in dplyr, using string vector input
sum_f <- function(col, df) {
library(tidyverse)
df <- df %>%
group_by_at(col) %>%
summarise(Val = sum(Val)) %>%
ungroup()
df[,2]
}
We then apply it to your dataset using lapply, and binding together the summations.
bind_cols(lapply(c('Perm1', 'Perm2', 'Perm3'), sum_f, df))
This gets us the above answer.
Caveats: Need to know the name of the columns you have to sum over for this to work. Also, each column needs to have the same levels of your permutations i.e. blue, green, yellow. The code will respect this ordering.

R - Trying to avoid a loop here

First post ever here, but I've been reading a lot so thanks!
I have a huge dataframes with many columns, but only 4 matter here:
dates/classes/names/grades.
For each date, i have several classes (with students), each with several people (names - always the same people in their respective classes), each one having ONE grade per date.
On the first date, I retrieve the best student per class considering his grade, using max[].
However, for the next dates, I want to do the following:
If the previous best student is still in the top 3 of his class, then we consider him to still be the best one.
Else, we consider the new 1st student to be the best one.
Hence, every date depends on the previous one.
Is it possible to do this without a loop?
I can't find out how, as every iteration depends on the previous one.
This is my code below.
Apologies if it's not optimized!
Thanks a lot :)
for (i in (1:(length(horizon)-1))) #horizon is the vector of dates
{
uni3 <- dataaf[dataaf[,1] == as.numeric(horizon[i]),] #dataaf contains all the data, we only keep the date for the considered date i
if (i == 1) #we take the best student per class
{
selecdate <- data.frame() #selecdate is the dataframe containing the best people for this date
for (z in (1:15) #15 classes
{
selecsec <- na.omit(uni3[uni3[,14] == z,]) #classes are column 14
ligneselec <- max(selecsec[,13]) #grades are column 13
selecsec <- data.frame(uni3[match(ligneselec,uni3[,13]),])
selecdate <- rbind(selecdate,selecsec)
}
}
else { #we keep a student if he was in the previous top 3, else we take the best one
selecdate <- data.frame()
for (z in (1:15))
{
lastsec <- na.omit(lastdate[lastdate[,14] == z,]) #last results
#retrieving the top 3 people this date
selecsec <- na.omit(uni3[uni3[,14] == z,])
newligneselec <- tail(sort(selecsec[,13]),3)
selecsec <- data.frame(selecsec[rev(match(newligneselec,selecsec[,13])),])
if((length(match(selecsec[,3],lastsec[,3])[!is.na(match(selecsec[,3],lastsec[,3]))]) == 0))
{
ligneselec <- max(selecsec[,13])
selecsec <- data.frame(uni3[match(ligneselec,uni3[,13]),])
}
else
{
selecsec <- lastsec
}
selecdate <- rbind(selecdate,selecsec)
}
}
lastdate <- selecdate #recording the last results
}
EDIT : Here is an example.
In date 1, John and Audrey are both selected in class 1 and 2.
On date 2, John is still among the best 3, so he remains selected,
while Audrey is only 4th so Jim (ranked 1st for the date 2) replaces
her.
On date 3, John is still among the best 3, so he remains selected (no ties issues in the data I work on). Jim is now 4th, so Sandra takes his place.
structure(list(Dates = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("12/02", "13/02", "14/02"
), class = "factor"), Classes = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2
), Names = structure(c(6L, 3L, 9L, 7L, 1L, 8L, 4L, 10L, 5L, 2L,
6L, 3L, 9L, 7L, 1L, 8L, 4L, 10L, 5L, 2L, 6L, 3L, 9L, 7L, 1L,
8L, 4L, 10L, 5L, 2L), .Label = c("Ashley", "Audrey", "Bob", "Denis",
"Jim", "John", "Kim", "Sandra", "Terry", "Tim"), class = "factor"),
Grades = c(10, 5, 3, 2, 1, 3, 4, 5, 6, 7, 8, 2, 10, 9, 1,
7, 5, 1, 8, 2, 5, 1, 4, 8, 8, 7, 6, 5, 4, 3)), .Names = c("Dates",
"Classes", "Names", "Grades"), row.names = c(NA, -30L), class = "data.frame")
Edited to reflect clarified request in the comments.
###---------- CREATING THE DATA (may be different from what you had in mind)
# Classes and Students
Classes <- c("U.S. History", "English", "NonLinear Optimization")
Students <- c("James", "Jamie", "John", "Jim", "Jane", "Jordan", "Jose")
df.1 <- expand.grid(Classes = Classes, Students = Students, stringsAsFactors = T)
# Generate Dates
Dates.seq <- seq(as.Date("2017/2/10"), as.Date("2017/3/27"), "days")
df.2 <- merge(Dates.seq, df.1)
# Generate Grades
grading <- c(4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7)
Grades <- sample(grading, size = dim(df.2)[1], replace = T, prob = grading/sum(grading)) # smart students
df <- data.frame(df.2, Grades)
colnames(df) <- c("Dates","Classes","Students","Grades")
# Works assuming your df has the following labeled and formatted columns
str(df)
#'data.frame': 966 obs. of 4 variables:
# $ Dates : Date, format: "2017-02-10" "2017-02-11" "2017-02-12" ...
# $ Classes : Factor w/ 3 levels "U.S. History",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Students: Factor w/ 7 levels "James","Jamie",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Grades : num 2.3 3.3 2.3 3.3 2.7 4 4 1.7 2.3 4 ...
# No aggregateion, just splitting by classes
df.split1 <- split(df, df[,"Classes"])
# Then splitting each of those lists by Dates
df.split2 <- lapply(df.split1, function(x) split(x, x[,"Dates"]))
# double the lapply becuase now we have lists within lists
top1 <- lapply(df.split2, function(i) lapply(i, function(j) j[order(-j[,"Grades"])[1], "Students"]))
top3 <- lapply(df.split2, function(i) lapply(i, function(j) j[order(-j[,"Grades"])[1:3], "Students"]))
# Easier to read
AllClasses <- levels(df[,"Classes"])
AllDates <- unique(df[,"Dates"])
# Initialize a matrix to keep track of changes in the Top1 and Top3
superstar <- matrix(NA, nrow = length(AllDates), ncol = length(AllClasses),
dimnames = list(as.character(AllDates), AllClasses))
# Looping
for(date in 1:length(AllDates)){
for(class in AllClasses){
if(date == 1){
# First NewTop1 = First Top1
superstar[date, class] <- unlist(top1[[class]][date])
} else {
# If superstar in date-1 is in the Top3 of date now,
if(superstar[date-1, class] %in% as.numeric(unlist(top3[[class]][date]))){
# still superstar
superstar[date,class] <- superstar[date-1, class]
} else{
# new superstar is highest scorer of date now
superstar[date,class] <- unlist(top1[[class]][date])
}
}
}
}
# painful for me trying to figure out how to convert superstar numbers to names but this worked
superstar.char <- as.data.frame(matrix(levels(df[,"Students"])[superstar], ncol = length(AllClasses)))
dimnames(superstar.char) <- dimnames(superstar)
superstar.char # superstar with Students as characters
Let me know if you have any difficulties!
It is possible to solve anything you would otherwise solve in a loop with a recursive function (a function that calls itself). Since you are changing the behavior of the function depending on i you'll need to pass i as param into the function. You'll also need the function to be able to realize when it is done and return the result set.

Dataframe manipulation: Convert certain columns of a dataframe into a list based on a key value column

I have a DF like the example created by the code below.
a = data.frame( name = c(rep("Tim",5),rep("John",3)),id = c(rep(1,5),rep(2,3)), value = 1:7)
And I want to transform it into a result that looks like this.
b = data.frame( name = c("Tim","John"), ID = c(1:2), b = NA)
b$value = list(c(1:5),c(6:8))
How would I go about doing this transformation?
For the actual data frame, I will have many columns to the left of the ID column, which I will want to perform calculations on with the columns of lists that will be created on the right side of the ID field.
For example, on the DF b above, I might want to perform a function call with "Tim" as an argument and loop through each individual element in the list = {1,2,3,4,5} and the output of that loop is another list with the same number of elements.
Try
aggregate(value~.,a, FUN=c)
# name id value
#1 Tim 1 1, 2, 3, 4, 5
#2 John 2 6, 7, 8
data
a <- structure(list(name = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L), .Label = c("John", "Tim"), class = "factor"), id = c(1,
1, 1, 1, 1, 2, 2, 2), value = 1:8), .Names = c("name", "id",
"value"), row.names = c(NA, -8L), class = "data.frame")

Resources