Matching Individuals to a single group - r

I have to match students to their teachers using students' id and their teachers' id. One student can be paired with only one teacher. However, a student may have multiple entries with the same teacher. I want to retain only the cases where students are matched to a single unique teacher. Here is an example to the data frame.
CHID <- c(111,111,111,112,112,113,113,113,113,114), TEAID <- c(115,115,115,162,165,168,168,168,187,119), SCORE <- c(56,56,56,55,55,58,58,58,58,64)
From these data, I want to retain students with CHID's 111 and 114 since they are matched to one and only one unique teacher. Can some please help with an r-code for performing this task? Your assistance will be greatly appreciated.

Here's a solution using dplyr package -
df <- data.frame(
CHID = c(111,111,111,112,112,113,113,113,113,114),
TEAID = c(115,115,115,162,165,168,168,168,187,119),
SCORE = c(56,56,56,55,55,58,58,58,58,64)
)
group_by(df, CHID) %>% filter(n_distinct(TEAID) == 1) %>% ungroup()
# A tibble: 4 x 3
# CHID TEAID SCORE
# <dbl> <dbl> <dbl>
# 1 111 115 56.0
# 2 111 115 56.0
# 3 111 115 56.0
# 4 114 119 64.0
Here's a solution without any foreign packages -
df[ave(df$TEAID, df$CHID, FUN = function(x) length(unique(x))) == 1, ]
# CHID TEAID SCORE
# 1 111 115 56
# 2 111 115 56
# 3 111 115 56
# 10 114 119 64

Related

Problem with creating kinship matrix in R (kinship2) without sex

I need to create a kinship matrix. For this purpose, I wanted to use the kinship2 library in R, but the sex variable is required, which I don't have. From the documentation I read that you can use the value "3" for unknown gender, but it doesn't work. My code where I'm trying to get the effect but it comes out wrong 1x1 matrix.
My data:
nr.os nr.oj nr.ma ferma
1 169 152 84 3
2 170 152 84 3
3 171 152 84 3
4 172 152 84 3
5 173 152 84 3
6 174 152 84 3
My code:
library(kinship2)
my_data <- read.table("Zeszyt_s1.csv",header = TRUE, sep = ";")
my_data$sex <- as.integer(3)
df_fixed <- fixParents(id = my_data$nr.os, dadid=my_data$nr.oj, momid=my_data$nr.ma, sex=my_data$sex)
pedAll <- with(df_fixed,pedigree(
id = id,
dadid = dadid,
momid = momid,
sex = sex))
kinship(pedAll["1"])
Output:
1
1 0.5

Creating Boxplot in R

I have a table with data on the sales volumes of some products. I want to build several boxplots for each product. I.e. vertically I have sales volume and horizontally I have days. When building, I do not build boxplots in certain values. What is the reason for this?
Here is table:
Day Cottage cheese..pcs. Kefir..pcs. Sour cream..pcs.
1 1 99 103 111
2 2 86 101 114
3 3 92 100 116
4 4 87 112 120
5 5 86 104 111
6 6 88 105 122
7 7 88 106 118
Here is my code:
head(out1)# out1-the table above
boxplot(Day~Cottage cheese..pcs., data = out1)
Here is the result:
Try below:
# example data
out1 <- read.table(text = " Day Cottage.cheese Kefir Sour.cream
1 1 99 103 111
2 2 86 101 114
3 3 92 100 116
4 4 87 112 120
5 5 86 104 111
6 6 88 105 122
7 7 88 106 118", header = TRUE)
# reshape wide-to-long
outlong <- stats::reshape(out1, idvar = "Day", v.names = "value",
time = "product", times = colnames(out1)[2:4],
varying = colnames(out1)[2:4], direction = "long")
# then plot
boxplot(value~product, outlong)
In addition to the provided answer, if you desire to vertically have sales volume and horitontally have days (using the out1 data provided by zx8754).
library(tidyr)
library(data.table)
library(ggplot2)
#data from wide to long
dt <- pivot_longer(out1, cols = c("Kefir", "Sour.cream", "Cottage.cheese"), names_to = "Product", values_to = "Value")
#set dt to data.table object
setDT(dt)
#convert day from integer to a factor
dt[, Day := as.factor(Day)]
#ggplot
ggplot(dt, aes(x = Day, y = Value)) + geom_bar(stat = "identity") + facet_wrap(~Product)
facet_wrap provides separate graphs for the three products.
I created a bar chart here since boxplots would be useless in this case (every product has only one value each day)

Create a sequence of values by group between a min and max interval using dplyr

this is surely a basic question but couldn't find a way to solve.
I need to create a sequence of values for a minimum (dds_min) to maximum (dds_max) per group (fs).
This is my data:
fs <- c("early", "late")
dds_min <-as.numeric(c("47.2", "40"))
dds_max <-as.numeric(c("122", "105"))
dds_min.max <-as.data.frame(cbind(fs,dds_min, dds_max))
And this is what I did....
dss_levels <-dds_min.max %>%
group_by(fs) %>%
mutate(dds=seq(dds_min,dds_max,length.out=100))
I intended to create a new variable (dds), that has to be 100 length and start and end at different values depending on "fs". My expectation was to end with another dataframe (dss_levels) with two columns (fs and dds), 200 values on it.
But I am getting this error.
Error: Column `dds` must be length 1 (the group size), not 100
In addition: Warning messages:
1: In Ops.factor(to, from) : ‘-’ not meaningful for factors
2: In Ops.factor(from, seq_len(length.out - 2L) * by) :
‘+’ not meaningful for factors
Any help would be really appreciated.
Thanks!
I make the sequence length 5 for illustrative purposes, you can change it to 100.
library(purrr)
library(tidyr)
dds_min.max %>%
mutate(dds= map2(dds_min, dds_max, seq, length.out = 5)) %>%
unnest(cols = dds)
# # A tibble: 10 x 4
# fs dds_min dds_max dds
# <fct> <dbl> <dbl> <dbl>
# 1 early 47.2 122 47.2
# 2 early 47.2 122 65.9
# 3 early 47.2 122 84.6
# 4 early 47.2 122 103.
# 5 early 47.2 122 122
# 6 late 40 105 40
# 7 late 40 105 56.2
# 8 late 40 105 72.5
# 9 late 40 105 88.8
# 10 late 40 105 105
Using this data (make sure your numeric columns are numeric! Don't use cbind!)
fs <- c("early", "late")
dds_min <-c(47.2, 40)
dds_max <-c(122, 105)
dds_min.max <-data.frame(fs,dds_min, dds_max)

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Resources