Related
I have a pair data as below and I want to make the expected value of the difference in the value (column called value) of pairs. In all the pairs, one has disease and the other one does not have disease as you can see from the data. In other words, the expected value of the difference of the value in one sibling compare to his/her sibling.
The description of the variable in the data are:
id = individual ID
family ID = family ID showing their dependency
status = 1 means disease and status = 0 means no-disease
Any guidance is appreciated.
d <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
familyID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
status = c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
I'm not certain if this is what you are looking for, but I used pivot_wider from tidyr to spread the values into two columns, though with status 0 and those with status 1. Then I used mutate to take a difference between the two columns, then plotted the familyID by the newly created difference with ggplot. Note that I removed the id column for the pivot_wider to work.
d <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
familyID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
status = c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
library(dplyr)
library(tidyr)
library(ggplot2)
d%>%
select(-id)%>%
pivot_wider(values_from = value, names_from = status)%>%
mutate("Diff" = (`0`-`1`))%>%
ggplot()+
aes(as.character(familyID), Diff)+
geom_point()
You can group by familyID, then use summarize() from the dplyr package to find the differences.
Also note the conversion of id, familyID, and status to factors, which may make life easier so they aren't confused with being integers.
library(dplyr)
library(forcats)
library(ggplot2)
d <- structure(list(id = as.factor(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)),
familyID = as.factor(c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10)),
status = as.factor(c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1)),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
diffs <- group_by(d, familyID) %>%
summarize(., diff = (value[status == 0] - value[status == 1]))
Reordering the families by difference can help get a sense of the distribution of differences
diffs$familyID <- fct_reorder(diffs$familyID, diffs$diff, .desc = TRUE)
ggplot(diffs, aes(x = familyID, y = diff)) +
geom_bar(stat="identity")
If you really have a lot of families you may want to display a summary of the differences.
One option is with a histogram (modifying binwidth can control how fine the bins are):
ggplot(diffs, aes(x = diff)) +
geom_histogram(binwidth = 3)
Similar to a histogram is a density plot:
ggplot(diffs, aes(x = diff)) +
geom_density()
Finally, a boxplot is also a familiar summary. They're mostly meant for comparing multiple groups, but it works okay with just one. I've added the individual points using the geom_jitter() function.
ggplot(diffs, aes(y = diff)) + #If using multiple groups add x=group inside the aes() function.
geom_boxplot() +
geom_jitter(aes(x = 0))
This question already has answers here:
How do I dichotomise efficiently
(5 answers)
How to one hot encode several categorical variables in R
(5 answers)
Closed 9 months ago.
I am working on a project that requires me to one-hot code a single variable and I cannot seem to do it correctly.
I simply want to one-hot code the variable data$Ratings so that the values for 1,2,3 and separated in the dataframe and only equal either 0 or 1. E.g., if data$Ratings = 3 then the dummy would = 1. All the other columns are not to change.
structure(list(ID = c(284921427, 284926400, 284946595, 285755462,
285831220, 286210009, 286313771, 286363959, 286566987, 286682679
), AUR = c(4, 3.5, 3, 3.5, 3.5, 3, 2.5, 2.5, 2.5, 2.5), URC = c(3553,
284, 8376, 190394, 28, 47, 35, 125, 44, 184), Price = c(2.99,
1.99, 0, 0, 2.99, 0, 0, 0.99, 0, 0), AgeRating = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), Size = c(15853568, 12328960, 674816, 21552128,
34689024, 48672768, 6328320, 64333824, 2657280, 1466515), HasSubtitle = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), InAppSum = c(0, 0, 0, 0, 0, 1.99,
0, 0, 0, 0), InAppMin = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppMax = c(0,
0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppCount = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), InAppAvg = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0),
descriptionTermCount = c(263, 204, 97, 272, 365, 368, 113,
129, 61, 87), LanguagesCount = c(17, 1, 1, 17, 15, 1, 0,
1, 1, 1), EngSupported = c(2, 2, 2, 2, 2, 2, 1, 2, 1, 2),
GenreCount = c(2, 2, 2, 2, 3, 3, 3, 2, 3, 2), months = c(7,
7, 7, 7, 7, 7, 7, 8, 8, 8), monthsSinceUpdate = c(29, 17,
25, 29, 15, 6, 71, 12, 23, 134), GameFree = c(0, 0, 0, 0,
0, 1, 0, 0, 0, 0), Ratings = c(3, 3, 3, 3, 2, 3, 2, 3, 2,
3)), row.names = c(NA, 10L), class = "data.frame")
install.packages("mlbench")
install.packages("neuralnet")
install.packages("mltools")
library(mlbench)
library(dplyr)
library(caret)
library(mltools)
library(tidyr)
data2 <- mutate_if(data, is.factor,as.numeric)
data3 <- lapply(data2, function(x) as.numeric(as.character(x)))
data <- data.frame(data3)
summary(data)
head(data)
str(data)
View(data)
#
dput(head(data, 10))
data %>% mutate(value = 1) %>% spread(data$Ratings, value, fill = 0 )
Is this what you want? I will assume your data is called data and continue with that for the data frame you supplied:
library(plm)
plm::make.dummies(data$Ratings) # returns a matrix
## 2 3
## 2 1 0
## 3 0 1
# returns the full data frame with dummies added:
plm::make.dummies(data, col = "Ratings")
## [not printed to save space]
There are some options for plm::make.dummies, e.g., you can select the base category via base and you can choose whether to include the base (add.base = TRUE) or not (add.base = FALSE).
The help page ?plm::make.dummies has more examples and explanation as well as a comparison for LSDV model estimation by a factor variable and by explicitly self-created dummies.
I am looking at the mediation of AUDITCEN --> INTERN through W1_cesd.
The relationship between AUDITCEN and INTERN is quadradtic, but the relationship between AUDITCEN and W1_cesd is linear. I think this is causing me issues....
I am running:
dyad_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
INTERN <- c(4, 3, 4, 2, 2, 6, 8, 6, 9, 9)
AUDITCEN <- c(5.9, -6.1, -9.1, -5.1, -7.1, -6.1, 0.9, -2.1, -7.1, 1.9)
W1_cesd <- c(25, 8, 5, 0, 5, 17, 10, 5, 5, 7)
GENDERKID<- c(0, 0, 1, 1, 0, 1, 1, 0, 1, 0)
C_AGE_DI <- c(0, 0, 1, 1, 0, 0, 0, 0, 0, 1)
RACE_W <- c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1)
RACE_O <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
cesd <- data.frame(dyad_id, INTERN, AUDITCEN, W1_cesd, GENDERKID, C_AGE_DI, RACE_W, RACE_O)
library(mediation)
med.fit <-glm(W1_cesd ~ AUDITCEN + GENDERKID + C_AGE_DI + RACE_W + RACE_O, data=cesd )
out.fit <-glm(INTERN ~ W1_cesd+ poly(AUDITCEN, 2) + GENDERKID + C_AGE_DI + RACE_W + RACE_O, data=cesd )
results <-mediate(med.fit, out.fit, sims = 1000, boot = TRUE, treat = "AUDITCEN", control.value=-10, treat.value=0, mediator = "W1_cesd")
"results" produces the following error:
Error in `[.data.frame`(y.data, , treat) : undefined columns selected
My treatment variable does exist and the 2 models look fine. What is going wrong? Am I doing something wrong when I specify the quadratic association for my exp-out relationship?
The columns represent the grade of response of the respondent and the rows are the representation of the groups of ages. The table was generated with a (matrix?), the goal is to (graph? O make graphic) how the different groups of ages behave with the responses
tabla<-matrix(c(0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 9, 0,
9, 1, 1, 5, 22, 0,
18, 1, 3, 1, 27, 1,
25, 7, 4, 6, 22, 3,
20, 2, 0, 0, 18, 1,
6, 2, 0, 2, 22, 0,
2, 0, 1, 1, 0, 4,
12, 0, 0, 5, 6, 0),ncol=6,byrow=TRUE)
colnames(tabla)<-c("No","is a problem","lite preblem","a moderate proble","Big problem","No respond")
rownames(tabla)<-c("16-24.5","24.5-33","33-41.5","41.5-50","50-58.5","58.5-67","67-75.5","75.5-84","No responde")
I think heatmap is a good choice. Here is a solution using the tidyverse package.
library(tidyverse)
tabla2 <- tabla %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(Column, Value, -rowname)
ggplot(tabla2, aes(x = rowname, y = Column, fill = Value)) +
geom_tile() +
scale_fill_gradientn(name = "", colors = terrain.colors(10)) +
scale_x_discrete(name = "") +
scale_y_discrete(name = "")
I have the following simple data
data <- structure(list(status = c(9, 5, 9, 10, 11, 10, 8, 6, 6, 7, 10,
10, 7, 11, 11, 7, NA, 9, 11, 9, 10, 8, 9, 10, 7, 11, 9, 10, 9,
9, 8, 9, 11, 9, 11, 7, 8, 6, 11, 10, 9, 11, 11, 10, 11, 10, 9,
11, 7, 8, 8, 9, 4, 11, 11, 8, 7, 7, 11, 11, 11, 6, 7, 11, 6,
10, 10, 9, 10, 10, 8, 8, 10, 4, 8, 5, 8, 7), statusgruppe = c(0,
0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, NA, 0, 1, 0, 1,
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1,
1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,
1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0)), .Names = c("status",
"statusgruppe"), class = "data.frame", row.names = c(NA, -78L
))
from that I'd like to make a histogram:
ggplot(data, aes(status))+
geom_histogram(aes(y=..density..),
binwidth=1, colour = "black",
fill="white")+
theme_bw()+
scale_x_continuous("Staus", breaks=c(min(data$status,na.rm=T), median(data$status, na.rm=T), max(data$status, na.rm=T)),labels=c("Low", "Middle", "High"))+
scale_y_continuous("Percent", formatter="percent")
Now - i'd like for the bins to take colou according to value - e.g. bins with value > 9 gets dark grey - everything else should be light grey.
I have tried with fill=statusgruppe, scale_fill_grey(breaks=9) etc. - but I can't get it to work. Any ideas?
Hopefully this should get you started:
ggplot(data, aes(status, fill = ..x..))+
geom_histogram(binwidth = 1) +
scale_fill_gradient(low = "black", high = "white")
ggplot(data, aes(status, fill = ..x.. > 9))+
geom_histogram(binwidth = 1) +
scale_fill_grey()
How about using fill=..count.. or fill=I(..count..>9) right after y=..density..? You have to tinker with the legend title and labels a bit, but it gets the coloring right.
EDIT:
It seems I misunderstood your question a bit. If you want to define color based on the x-coordinate, you can use the ..x.. automatic variable similarly.
What about scale_manual? Here's link to Hadley's site. I've used this function to set an appropriate fill colour for a boxplot. Not sure if it'll work with histogram, though...