Complex If Else Statement in For Loop in R Warning Message - r

I made a for loop with many if else statements for my dataset and 2 empty vectors. However, I am getting a warning message saying:
In closenessSupport[i] <- rowMeans(seniorEdPlans[c("closenessFriends", ... :
number of items to replace is not a multiple of replacement length.
I just wondering on how to fix this vector length problem because I think it is messing with my intention to find the mean of 2 columns.. Any help appreciated.

Wow, way too many ith's for me. But a few nudges towards an answer. You definitely don't want a for loop down all the rows of your dataframe in this case. r is optimized to work on columns. I'm not totally sure I understand all your conditionals, but most likely dplyr::case_when will serve you well.
I grabbed your data and dputted just the first 20 rows. Then I wrote a mutate and case_when that produces a start towards closenessSupport. Is this sort of what you're out to do?
Revised after your additional input just the columns of interest
# https://stackoverflow.com/questions/61582653
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
seniored <- structure(list(id = 1:20,
age = c(17L, 16L, 17L, 16L, 17L, 18L,
17L, 17L, 18L, 16L, 17L, 17L, 17L, 17L, 17L, 17L, 16L, 17L, 16L,
18L),
higherEd = structure(c(1L, 5L, 1L, 1L, 3L, 1L, 2L, 2L,
5L, 5L, 3L, 4L, 3L, 2L, 5L, 3L, 4L, 5L, 1L, 1L), .Label = c("2-year",
"4-year", "None", "Other", "Vocational"), class = "factor"),
riskGroup = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 3L,
1L, 3L, 3L, 2L, 1L, 3L, 2L, 2L, 3L, 1L, 3L), .Label = c("High",
"Low", "Medium"), class = "factor"),
GPA = c(3.169, 2.703,
3.225, 2.488, 2.618, 2.928, 3.176, 3.256, 3.48, 3.244, 3.265,
3.4, 3.109, 3.513, 3.102, 2.656, 2.853, 3.046, 2.304, 3.473
),
closenessFriends = c(7L, 7L, 7L, 8L, NA, NA, NA, 6L, 7L,
NA, 5L, 6L, 3L, 1L, 1L, NA, 8L, 2L, NA, 8L),
closenessMentors = c(6L,
NA, 5L, NA, 5L, 4L, 8L, 6L, 4L, 5L, 4L, 4L, 4L, 5L, 5L, 5L,
3L, 4L, NA, 5L),
numSupportSources = c(2L, 1L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 0L, 2L
)), row.names = c(NA, 20L), class = "data.frame")
seniored %>%
mutate(
closenessSupport = case_when(
numSupportSources == 1 & !is.na(closenessFriends) ~ as.numeric(closenessFriends),
numSupportSources == 1 & !is.na(closenessMentors) ~ as.numeric(closenessMentors),
numSupportSources == 2 ~ (closenessFriends + closenessMentors)/2,
numSupportSources == 0 ~ NA_real_),
supportType = case_when(
numSupportSources == 1 & !is.na(closenessFriends) ~ "FriendOnly",
numSupportSources == 1 & !is.na(closenessMentors) ~ "MentorOnly",
numSupportSources == 2 ~ "Both",
numSupportSources == 0 ~ "Neither"
)
) %>%
select(numSupportSources, closenessFriends, closenessMentors, closenessSupport, supportType)
#> numSupportSources closenessFriends closenessMentors closenessSupport
#> 1 2 7 6 6.5
#> 2 1 7 NA 7.0
#> 3 2 7 5 6.0
#> 4 1 8 NA 8.0
#> 5 1 NA 5 5.0
#> 6 1 NA 4 4.0
#> 7 1 NA 8 8.0
#> 8 2 6 6 6.0
#> 9 2 7 4 5.5
#> 10 1 NA 5 5.0
#> 11 2 5 4 4.5
#> 12 2 6 4 5.0
#> 13 2 3 4 3.5
#> 14 2 1 5 3.0
#> 15 2 1 5 3.0
#> 16 1 NA 5 5.0
#> 17 2 8 3 5.5
#> 18 2 2 4 3.0
#> 19 0 NA NA NA
#> 20 2 8 5 6.5
#> supportType
#> 1 Both
#> 2 FriendOnly
#> 3 Both
#> 4 FriendOnly
#> 5 MentorOnly
#> 6 MentorOnly
#> 7 MentorOnly
#> 8 Both
#> 9 Both
#> 10 MentorOnly
#> 11 Both
#> 12 Both
#> 13 Both
#> 14 Both
#> 15 Both
#> 16 MentorOnly
#> 17 Both
#> 18 Both
#> 19 Neither
#> 20 Both
Created on 2020-05-04 by the reprex package (v0.3.0)

Please accept one of the right answers
You had asked yesterday what was wrong with your loop. I looked today. The issue was running the rowwise inside the loop. It's already based on rows so running it inside a for loop that iterates through your rows is bound to cause problems.
I also made an example data set with representative values for your data. May not matter for your current data but a for loop will be much slower. In the 20,000 rows case a for loop took 1.4 seconds. The dplyr solution 11 milliseconds.
# build a reproducible dataset assume valid scores 1 - 8
# we'll make 9's equal to NA
set.seed(2020)
a <- sample(1:9, 20000, replace = TRUE)
a[a == 9] <- NA
set.seed(2021)
b <- sample(1:9, 20000, replace = TRUE)
b[b == 9] <- NA
seniorEdPlans2 <- data.frame(closenessFriends = a,
closenessMentors = b)
# use apply to calculate numSupportSources
seniorEdPlans2$numSupportSources <- apply(seniorEdPlans2,
1,
function(x) sum(!is.na(x))
)
# head(seniorEdPlans2, 50) # close enough
# this was the source of your error message it's already
# row based so can't put it in a for loop
seniorEdPlans2$closenessSupport <- rowMeans(seniorEdPlans2[c('closenessFriends', 'closenessMentors')],
na.rm = TRUE)
# your for loop
for (i in 1:nrow(seniorEdPlans2)) {
if (seniorEdPlans2$numSupportSources[i] == 2) {
seniorEdPlans2$supportType[i] <- "Both"
} else if (seniorEdPlans2$numSupportSources[i] == 0) {
seniorEdPlans2$supportType[i] <- "Neither"
} else if (!is.na(seniorEdPlans2$closenessFriends[i])) {
seniorEdPlans2$supportType[i] <- "FriendOnly"
} else {
seniorEdPlans2$supportType[i] <- "MentorOnly"
}
}
# head(seniorEdPlans2, 50)
Created on 2020-05-05 by the reprex package (v0.3.0)

Related

Is there any R function to make this happen?

Hi this is an excel form of data i want to be able to create in R
Just want to make it clear, I need to be able to make the column Group_fix equal to 5 for the next 12 months period observation, every time an observation date has 5 in its Group column.
How to make it possible in R? Can we use ifelse function?
Here is an approach with lag from dplyr.
library(dplyr)
data %>%
mutate(GroupFix = case_when(Group == 5 |
lag(Group,2) == 5 |
lag(Group,2) == 5 |
lag(Group,3) == 5 |
lag(Group,4) == 5 |
lag(Group,5) == 5 |
lag(Group,6) == 5 |
lag(Group,7) == 5 |
lag(Group,8) == 5 |
lag(Group,9) == 5 |
lag(Group,10) == 5 |
lag(Group,11) == 5 ~ 5,
TRUE ~ as.numeric(Group)))
Observation.Date Group GroupFix
1 12/31/19 1 1
2 1/31/20 2 2
3 2/29/20 2 2
4 3/31/20 2 2
5 4/30/20 3 3
6 5/31/20 4 4
7 6/30/20 5 5
8 7/31/20 5 5
9 8/31/20 4 5
10 9/30/20 3 5
11 10/31/20 2 5
12 11/30/20 3 5
13 12/31/20 4 5
14 1/31/21 5 5
15 2/28/21 5 5
16 3/31/21 4 5
17 4/30/21 3 5
18 5/31/21 2 5
19 6/30/21 1 5
20 7/31/21 1 5
21 8/31/21 1 5
22 9/30/21 1 5
23 10/31/21 1 5
24 11/30/21 1 5
25 12/31/21 1 5
26 1/31/22 1 5
27 2/28/22 1 1
Data
data <- structure(list(Observation.Date = structure(c(8L, 1L, 13L, 14L,
16L, 18L, 20L, 22L, 24L, 26L, 4L, 6L, 9L, 2L, 11L, 15L, 17L,
19L, 21L, 23L, 25L, 27L, 5L, 7L, 10L, 3L, 12L), .Label = c("1/31/20",
"1/31/21", "1/31/22", "10/31/20", "10/31/21", "11/30/20", "11/30/21",
"12/31/19", "12/31/20", "12/31/21", "2/28/21", "2/28/22", "2/29/20",
"3/31/20", "3/31/21", "4/30/20", "4/30/21", "5/31/20", "5/31/21",
"6/30/20", "6/30/21", "7/31/20", "7/31/21", "8/31/20", "8/31/21",
"9/30/20", "9/30/21"), class = "factor"), Group = c(1L, 2L, 2L,
2L, 3L, 4L, 5L, 5L, 4L, 3L, 2L, 3L, 4L, 5L, 5L, 4L, 3L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-27L))

Checking that two different variables match

I have a dataset that I need to run a check of two variables "Call" and "cluster".
I need to check that calls of the same value all have the same cluster number.
The problem is the K-means clustering method I used assigns different cluster numbers each time I run it.
So I need a check that looks at the allele 1 call for example and checks that the numbers are all the same under the cluster call. (the numbers could be listed as 2 one run and the change to 3 if I re-run the code)
Please see data structure below
structure(list(RFU1 = c(-0.295502405, 0.964070798, 3381.332182,
3532.769062, 3431.836843, 3242.966511, 2104.791167, 2220.008503,
3548.252161, 3506.51418, 2290.273178, 2281.587684, -5.64819475,
-11.73109864, 3784.914039, 3619.00781, 3618.211608, 3248.106466,
3394.650325, 3339.870196, 2449.202902, 2426.835174, 3432.153478,
2761.392304, -9.267907504, -7.365704356, 3743.092314, 3787.241702,
2172.027787, 2096.845649, 2135.649551, 2149.145547, 2293.757257,
2348.099108, 2321.019045, 2022.168867, -17.93532331, -12.59832941,
3805.416768, 3498.998138, 2304.597239, 2509.63987, 2181.11547,
2261.011876, 3432.453036, 3662.758933, 2371.11049, 3068.827061,
2612.107589, 2687.824075, 3179.315918, 3688.525218, 3465.327523,
3405.154043, 2535.514915, 2452.200079, 374.435299, 423.6015308,
3742.515563, 3578.777925, 2634.955017, 2527.514043, 3817.579252,
3550.999412, -10.72035816, 3294.486334, 3352.40368, 3463.150507,
3472.576514, 3741.898759, 3571.369947, 3720.645869, 3739.569593,
3855.583168, 418.6837047, 49.47548241, 2171.034284, 2155.314713,
3432.363384, 3582.508917, 3425.415274, 3487.203299, 3505.23909,
3413.342096, 113.5100691, 128.6414053, 2454.588175, 2323.061591,
3188.705702, 3376.950546, 3291.072437, 3181.001961, 3195.013863,
3776.919197, 2284.22659, 2277.338631), RFU2 = c(-8.346468029,
235.4058561, 637.9218251, 650.3759507, 617.4161748, 604.0792911,
4270.310727, 4199.615749, 689.863543, 712.6144338, 4274.287194,
4541.168491, -1.626221758, -2.437395631, 802.0941252, 730.5998997,
686.9037384, 625.8245403, 644.3586836, 642.8833044, 4937.691887,
5159.479928, 725.4449756, 573.3910899, -4.006398006, 213.2859144,
739.7910786, 731.0150586, 4408.81923, 4767.533969, 4302.641493,
4325.913445, 4597.47663, 4666.904418, 4800.357526, 4142.535329,
-17.23239968, 178.5311942, 778.305843, 743.1438168, 4214.507094,
4553.703511, 4629.339014, 4459.697405, 661.7299014, 727.1054982,
4553.170272, 5482.231486, 4520.517999, 4737.802036, 643.3599887,
726.4314715, 696.5968338, 697.6099599, 411.8118071, 409.4943424,
5687.32635, 5757.51512, 766.4240193, 779.2403225, 4745.055632,
4582.267792, 749.5679421, 675.8747055, -7.254521898, 628.3467565,
631.116767, 672.7064514, 687.2642132, 718.1192917, 731.785499,
668.3686048, 784.8055727, 791.3155894, 4471.047168, 4501.597841,
4504.670332, 4442.621066, 682.0632225, 706.6204595, 680.5242182,
683.9558692, 684.2909706, 618.6535251, 5727.684954, 6098.485474,
5099.952926, 4779.742057, 571.4303822, 614.9258218, 602.9830491,
651.2847695, 591.8833499, 742.2387568, 4443.376841, 4716.792177
), cluster = c(2L, 2L, 4L, 4L, 4L, 4L, 1L, 1L, 4L, 4L, 1L, 1L,
2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 4L, 4L, 2L, 2L, 4L, 4L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 4L, 4L, 1L, 1L, 1L, 1L,
4L, 4L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 4L, 4L,
1L, 1L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 1L, 1L, 4L, 4L, 4L, 4L,
4L, 4L, 1L, 1L)), row.names = c(NA, -96L), class = "data.frame")
First run of k-means clustering
First run
Second run of k-means clustering
Second run
Since it is a question of labels, you can always relabel your groups afterwards. I used 4 clusters based on the within sum of square criteria :
suppressPackageStartupMessages( library(dplyr) )
dfr <- as_tibble(dfr)
groups <- lapply(1:10, function(ct) kmeans(x = dfr[,c("RFU1", "RFU2")], centers = ct)$tot.withinss) # as.matrix(
# plot(unlist(groups))
Method 1 : fix the random seed
# --- Experiment 1
set.seed(123)
groups <- kmeans(x = dfr[,c("RFU1", "RFU2")], centers = 4)
dfr$cluster <- groups$cluster
dfr
#> # A tibble: 96 x 4
#> id RFU1 RFU2 cluster
#> <chr> <dbl> <dbl> <int>
#> 1 1 -0.296 -8.35 4
#> 2 2 0.964 235. 4
#> 3 3 3381. 638. 2
#> 4 4 3533. 650. 2
#> 5 5 3432. 617. 2
#> 6 6 3243. 604. 3
#> 7 7 2105. 4270. 1
#> 8 8 2220. 4200. 1
#> 9 9 3548. 690. 2
#> 10 10 3507. 713. 2
#> # ... with 86 more rows
# --- Experiment 2
set.seed(123)
groups <- kmeans(x = dfr[,c("RFU1", "RFU2")], centers = 4, )
dfr$cluster2 <- groups$cluster
dfr
#> # A tibble: 96 x 5
#> id RFU1 RFU2 cluster cluster2
#> <chr> <dbl> <dbl> <int> <int>
#> 1 1 -0.296 -8.35 4 4
#> 2 2 0.964 235. 4 4
#> 3 3 3381. 638. 2 2
#> 4 4 3533. 650. 2 2
#> 5 5 3432. 617. 2 2
#> 6 6 3243. 604. 3 3
#> 7 7 2105. 4270. 1 1
#> 8 8 2220. 4200. 1 1
#> 9 9 3548. 690. 2 2
#> 10 10 3507. 713. 2 2
#> # ... with 86 more rows
Method 2 : labels don't matter, groups do so don't do anything
Method 3 : labels do matter
3.1 my data is ordinate, label per cluster centers
lab <- order(order(groups$centers[,1])) # c(2, 4, 3, 1)
# head( cbind(dfr$cluster2, lab[dfr$cluster2] ) )
dfr <- dfr %>%
mutate(label1 = lab[dfr$cluster2] )
3.2 my data has important individuals, label using references
# Suppose individuals 1, 4, 6 and 7 could name the classes
dfr <- dfr %>% tibble::rownames_to_column(var = "id") %>%
mutate(id = paste0("id_", id) )
refs <- tibble(id = dfr$id[c(1, 4, 6, 7)],
cluster = dfr$cluster2[c(1, 4, 6, 7)]
)
dfr %>%
mutate(label2 = refs$id[ c(2, 4, 3, 1)[cluster2] ]
# label2 = letters[1:4][cluster2]
)
#> # A tibble: 96 x 7
#> id RFU1 RFU2 cluster cluster2 label1 label2
#> <chr> <dbl> <dbl> <int> <int> <int> <chr>
#> 1 1 -0.296 -8.35 4 4 1 1
#> 2 2 0.964 235. 4 4 1 1
#> 3 3 3381. 638. 2 2 4 7
#> 4 4 3533. 650. 2 2 4 7
#> 5 5 3432. 617. 2 2 4 7
#> 6 6 3243. 604. 3 3 3 6
#> 7 7 2105. 4270. 1 1 2 4
#> 8 8 2220. 4200. 1 1 2 4
#> 9 9 3548. 690. 2 2 4 7
#> 10 10 3507. 713. 2 2 4 7
#> # ... with 86 more rows

Group by partial string matches

I have a table with a list of categories each with a count value that i'd like to collapse across based on similarity ... for example Mariner-1_Amel and Mariner-10 would be a single category of Mariner and anything with 'Jockey' or 'hAT' in the name should be collapsed across.
I'm struggling to find a solution that can cope with all the possibilities. Is there an easy dplyr solution?
reproducible with
> dput(tibs)
structure(list(type = c("(TTAAG)n_1", "AMARI_1", "Copia-4_LH-I",
"DNA", "DNA-1_CQ", "DNA/hAT-Charlie", "DNA/hAT-Tip100", "DNA/MULE-MuDR",
"DNA/P", "DNA/PiggyBac", "DNA/TcMar-Mariner", "DNA/TcMar-Tc1",
"DNA/TcMar-Tigger", "G3_DM", "Gypsy-10_CFl-I", "hAT-1_DAn", "hAT-16_SM",
"hAT-N4_RPr", "HELITRON7_CB", "Jockey-1_DAn", "Jockey-1_DEl",
"Jockey-12_DF", "Jockey-5_DTa", "Jockey-6_DYa", "Jockey-6_Hmel",
"Jockey-7_HMM", "Jockey-8_Hmel", "LINE/Dong-R4", "LINE/I", "LINE/I-Jockey",
"LINE/I-Nimb", "LINE/Jockey", "LINE/L1", "LINE/L2", "LINE/R1",
"LINE/R2", "LINE/R2-NeSL", "LINE/Tad1", "LTR/Gypsy", "Mariner_CA",
"Mariner-1_AMel", "Mariner-10_HSal", "Mariner-13_ACe", "Mariner-15_HSal",
"Mariner-16_DAn", "Mariner-19_RPr", "Mariner-30_SM", "Mariner-39_SM",
"Mariner-42_HSal", "Mariner-46_HSal", "Mariner-49_HSal", "TE-5_EL",
"Unknown", "Utopia-1_Crp"), n = c(1L, 1L, 1L, 2L, 1L, 18L, 3L,
9L, 2L, 8L, 21L, 12L, 18L, 1L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 1L, 2L, 7L, 2L, 7L, 24L, 1L, 1L, 5L, 3L, 1L,
1L, 7L, 1L, 5L, 1L, 1L, 5L, 5L, 1L, 1L, 3L, 5L, 5L, 2L, 1L, 190L,
1L)), row.names = c(NA, -54L), class = c("tbl_df", "tbl", "data.frame"
))
It seems to me that your broader types are mostly/entirely at the beginning of the string. You could therefore use just the first alphanumerical sequence ([[:alnum:]]+) of the type as broader types. This would give you the following types:
library(tidyverse)
df %>%
mutate(type_short = str_extract(type, "[[:alnum:]]+")) %>%
count(type_short, sort = TRUE)
#> # A tibble: 15 x 2
#> type_short n
#> <chr> <int>
#> 1 Mariner 12
#> 2 LINE 11
#> 3 DNA 10
#> 4 Jockey 8
#> 5 hAT 3
#> 6 AMARI 1
#> 7 Copia 1
#> 8 G3 1
#> 9 Gypsy 1
#> 10 HELITRON7 1
#> 11 LTR 1
#> 12 TE 1
#> 13 TTAAG 1
#> 14 Unknown 1
#> 15 Utopia 1
You can easily use the new column to group_by:
df %>%
mutate(type_short = str_extract(type, "[[:alnum:]]+")) %>%
group_by(type_short) %>%
summarise(n = sum(n))
#> # A tibble: 15 x 2
#> type_short n
#> <chr> <int>
#> 1 AMARI 1
#> 2 Copia 1
#> 3 DNA 94
#> 4 G3 1
#> 5 Gypsy 3
#> 6 hAT 5
#> 7 HELITRON7 1
#> 8 Jockey 10
#> 9 LINE 54
#> 10 LTR 7
#> 11 Mariner 35
#> 12 TE 1
#> 13 TTAAG 1
#> 14 Unknown 190
#> 15 Utopia 1
Theoretically, you could also try to use string similarity here. Yet your types do not have great similarity among themselves. A relative Levenshtein distance (distance / characters of the longer string) for example retrieves results like this:
strings <- c("Mariner-1_Amel", "Mariner-10")
adist(strings) / max(nchar(strings))
#> [,1] [,2]
#> [1,] 0.0000000 0.3571429
#> [2,] 0.3571429 0.0000000
This could be interpreted as the two types being 36% similar. Finding a good threshold might be hard in that case.
This solution uses package dplyr function case_when and base R grepl.
library(dplyr)
tibs %>%
mutate(category = case_when(
grepl("hAT|Jockey", type) ~ "Jokey",
grepl("Mariner", type) ~ "Mariner",
grepl("DNA", type) ~ "DNA",
grepl("LINE", type) ~"LINE",
TRUE ~ as.character(type)
),
category = factor(category)
)
If there is no commonality to define the groups you can define individual conditions using case_when.
library(dplyr)
library(stringr)
tibs %>%
mutate(category = case_when(str_detect(type, 'Mariner-\\d+') ~ 'Mariner',
str_detect(type, 'Jockey|hAT') ~ 'common',
#Add more conditions
))

R spread vs gather in tidyr

I have a dataframe in the following form:
person currentTest beforeValue afterValue
1 1 A 1.284297055 2.671763513
2 2 A -0.618359548 -2.354926905
3 3 A 0.039457430 -0.091709968
4 4 A -0.448608324 -0.362851832
5 5 A -0.961777124 -1.416284339
6 6 A 0.702471895 2.052181444
7 7 A -0.455222045 -2.125684279
8 8 A -1.231549132 -2.777425148
9 9 A -0.797234990 -0.558306183
10 10 A -0.709734963 -1.244159550
11 1 B -0.472799377 -0.869472343
12 2 B 0.059720737 1.444855389
13 3 B 0.924201532 2.731049485
14 4 B 0.658884183 1.017542475
15 5 B -1.989807256 -4.712671740
16 6 B 0.660241305 1.971232718
17 7 B 0.089636952 -0.564457911
18 8 B -0.828399941 0.507659171
19 9 B -0.838074237 -0.316996942
20 10 B -1.659197101 -3.317623686
...
What I'd like is to get a data frame of:
person A_Before A_After B_Before, B_After, ...
1 1.284297055 2.671763513 -0.472799377 -0.869472343
2 -0.618359548 -2.354926905 0.059720737 1.444855389
...
I've tried gather and spread but that's not quite what I need as there's the creation of new columns. Any suggestions?
The dput version for easy access is below:
resultsData <- dput(resultsData)
structure(list(person = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L), currentTest = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("A", "B", "C",
"D", "E", "F"), class = "factor"), beforeValue = c(1.28429705541541,
-0.618359548370402, 0.039457429902531, -0.448608324038257, -0.961777123997687,
0.702471895259405, -0.455222044740939, -1.23154913153736, -0.797234989892673,
-0.709734963076803, -0.47279937661921, 0.0597207367403981, 0.924201531911827,
0.658884182599422, -1.98980725637449, 0.660241304554785, 0.0896369516528346,
-0.828399941497236, -0.838074236572976, -1.65919710134782, 0.577469369909437,
1.92748171699512, -0.245593641496638, 0.126104785456265, -0.559338325961641,
1.29802115505785, 0.719406692531958, 0.969414499181256, -0.814697072724845,
0.86465983690719, -0.709539159817187, 1.02775240926492, -0.50490096148732,
0.40769259465753, -0.868531009656408, 0.949518511358715, 2.32458579520932,
-0.257578702370506, -0.789761851618986, 0.0979274657020477, -0.00803566278013502,
1.42984177159549, 1.45485678109231, -0.956556613290905, 0.443323691839299,
-0.261951072972966, -1.30990441429799, 0.0921741874883992, -1.02612779569131,
0.81550719514697, -0.403037731404182, -0.384422139459082, 0.417074857491798,
-1.37128032791855, -0.0796160137501127, 1.35302483988882, -0.752751140138746,
0.812453275384099, -1.32443072805549, -1.66986584340583), afterValue = c(2.67176351335094,
-2.35492690509713, -0.0917099675669388, -0.362851831626841, -1.4162843393352,
2.05218144382074, -2.12568427901904, -2.77742514848958, -0.558306182843248,
-1.24415954975022, -0.869472343362331, 1.44485538931333, 2.73104948477609,
1.01754247530805, -4.71267174035743, 1.9712327179732, -0.564457911016569,
0.507659170771878, -0.31699694238194, -3.31762368638082, 1.09068172988414,
4.37537723545199, -0.116850493406969, 1.9533832597394, -1.69003563933244,
2.62250581307257, -0.00837379068728961, 1.84192937988371, -0.675899868505659,
2.08506660046288, -0.583526785879512, 0.699298693972492, -1.26172199141024,
1.23589313451783, -1.56008919968504, 0.436686458587792, 0.11699090169902,
-1.07206510594109, 1.21204947218164, -0.812406581646911, 0.50373332256566,
-0.084945367568491, -0.236015748624917, -0.479606239480476, -0.596799139055039,
-0.562575023441403, -0.339935276865152, -0.213813544612318, -0.265296303857373,
-1.12545083569158, 0.0105156062602101, 0.635695183644557, 0.767433440961415,
0.16648012185356, 0.544633089427927, -0.904001384160196, -0.429299134808951,
0.764224744168297, -0.166062348771635, -0.101892580202475)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -60L), .Names = c("person",
"currentTest", "beforeValue", "afterValue"))
We can use dcast from reshape2
library(reshape2)
meltdf <- melt(resultsData, id.vars=1:2)
dcast(meltdf, person ~ currentTest + variable)
> dcast(meltdf, person ~ currentTest + variable)
person A_beforeValue A_afterValue B_beforeValue B_afterValue C_beforeValue C_afterValue D_beforeValue D_afterValue E_beforeValue
1 1 1.28429706 2.67176351 -0.47279938 -0.8694723 0.5774694 1.090681730 -0.70953916 -0.5835268 -0.008035663
2 2 -0.61835955 -2.35492691 0.05972074 1.4448554 1.9274817 4.375377235 1.02775241 0.6992987 1.429841772
3 3 0.03945743 -0.09170997 0.92420153 2.7310495 -0.2455936 -0.116850493 -0.50490096 -1.2617220 1.454856781
4 4 -0.44860832 -0.36285183 0.65888418 1.0175425 0.1261048 1.953383260 0.40769259 1.2358931 -0.956556613
5 5 -0.96177712 -1.41628434 -1.98980726 -4.7126717 -0.5593383 -1.690035639 -0.86853101 -1.5600892 0.443323692
6 6 0.70247190 2.05218144 0.66024130 1.9712327 1.2980212 2.622505813 0.94951851 0.4366865 -0.261951073
7 7 -0.45522204 -2.12568428 0.08963695 -0.5644579 0.7194067 -0.008373791 2.32458580 0.1169909 -1.309904414
8 8 -1.23154913 -2.77742515 -0.82839994 0.5076592 0.9694145 1.841929380 -0.25757870 -1.0720651 0.092174187
9 9 -0.79723499 -0.55830618 -0.83807424 -0.3169969 -0.8146971 -0.675899869 -0.78976185 1.2120495 -1.026127796
10 10 -0.70973496 -1.24415955 -1.65919710 -3.3176237 0.8646598 2.085066600 0.09792747 -0.8124066 0.815507195
E_afterValue F_beforeValue F_afterValue
1 0.50373332 -0.40303773 0.01051561
2 -0.08494537 -0.38442214 0.63569518
3 -0.23601575 0.41707486 0.76743344
4 -0.47960624 -1.37128033 0.16648012
5 -0.59679914 -0.07961601 0.54463309
6 -0.56257502 1.35302484 -0.90400138
7 -0.33993528 -0.75275114 -0.42929913
8 -0.21381354 0.81245328 0.76422474
9 -0.26529630 -1.32443073 -0.16606235
10 -1.12545084 -1.66986584 -0.10189258
You can use a combined gather + spread approach; Gather the *Values columns and combine with currentTest to form the new header, then spread to wide format:
resultsData %>%
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value)
# A tibble: 10 x 13
# person A_afterValue A_beforeValue B_afterValue B_beforeValue C_afterValue C_beforeValue
# * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.67176351 1.28429706 -0.8694723 -0.47279938 1.090681730 0.5774694
# 2 2 -2.35492691 -0.61835955 1.4448554 0.05972074 4.375377235 1.9274817
# 3 3 -0.09170997 0.03945743 2.7310495 0.92420153 -0.116850493 -0.2455936
# 4 4 -0.36285183 -0.44860832 1.0175425 0.65888418 1.953383260 0.1261048
# 5 5 -1.41628434 -0.96177712 -4.7126717 -1.98980726 -1.690035639 -0.5593383
# 6 6 2.05218144 0.70247190 1.9712327 0.66024130 2.622505813 1.2980212
# 7 7 -2.12568428 -0.45522204 -0.5644579 0.08963695 -0.008373791 0.7194067
# 8 8 -2.77742515 -1.23154913 0.5076592 -0.82839994 1.841929380 0.9694145
# 9 9 -0.55830618 -0.79723499 -0.3169969 -0.83807424 -0.675899869 -0.8146971
#10 10 -1.24415955 -0.70973496 -3.3176237 -1.65919710 2.085066600 0.8646598
# ... with 6 more variables: D_afterValue <dbl>, D_beforeValue <dbl>, E_afterValue <dbl>,
# E_beforeValue <dbl>, F_afterValue <dbl>, F_beforeValue <dbl>
If you need to rename the columns:
resultsData %>%
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value) %>%
rename_at(vars(matches("Value$")), funs(gsub("Value$", "", .)))
We could do this in a single line using recast
reshape2::recast(resultsData, person ~currentTest + variable, id.var = 1:2)
#person A_beforeValue A_afterValue B_beforeValue B_afterValue C_beforeValue C_afterValue D_beforeValue D_afterValue
#1 1 1.28429706 2.67176351 -0.47279938 -0.8694723 0.5774694 1.090681730 -0.70953916 -0.5835268
#2 2 -0.61835955 -2.35492691 0.05972074 1.4448554 1.9274817 4.375377235 1.02775241 0.6992987
#3 3 0.03945743 -0.09170997 0.92420153 2.7310495 -0.2455936 -0.116850493 -0.50490096 -1.2617220
#4 4 -0.44860832 -0.36285183 0.65888418 1.0175425 0.1261048 1.953383260 0.40769259 1.2358931
#5 5 -0.96177712 -1.41628434 -1.98980726 -4.7126717 -0.5593383 -1.690035639 -0.86853101 -1.5600892
#6 6 0.70247190 2.05218144 0.66024130 1.9712327 1.2980212 2.622505813 0.94951851 0.4366865
#7 7 -0.45522204 -2.12568428 0.08963695 -0.5644579 0.7194067 -0.008373791 2.32458580 0.1169909
#8 8 -1.23154913 -2.77742515 -0.82839994 0.5076592 0.9694145 1.841929380 -0.25757870 -1.0720651
#9 9 -0.79723499 -0.55830618 -0.83807424 -0.3169969 -0.8146971 -0.675899869 -0.78976185 1.2120495
#10 10 -0.70973496 -1.24415955 -1.65919710 -3.3176237 0.8646598 2.085066600 0.09792747 -0.8124066
# E_beforeValue E_afterValue F_beforeValue F_afterValue
#1 -0.008035663 0.50373332 -0.40303773 0.01051561
#2 1.429841772 -0.08494537 -0.38442214 0.63569518
#3 1.454856781 -0.23601575 0.41707486 0.76743344
#4 -0.956556613 -0.47960624 -1.37128033 0.16648012
#5 0.443323692 -0.59679914 -0.07961601 0.54463309
#6 -0.261951073 -0.56257502 1.35302484 -0.90400138
#7 -1.309904414 -0.33993528 -0.75275114 -0.42929913
#8 0.092174187 -0.21381354 0.81245328 0.76422474
#9 -1.026127796 -0.26529630 -1.32443073 -0.16606235
#10 0.815507195 -1.12545084 -1.66986584 -0.10189258

Checking row format of csv

I am trying to import some data (below) and checking to see if I have the appropriate number of rows for later analysis.
repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838,
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176,
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938,
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553,
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L,
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L,
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L),
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L,
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName",
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA,
-22L))
The data gives 5 columns and is one example of some data that I would typically import (via a .csv file). As you can see there are three unique values in the column "QueueName". For each unique value in "QueueName" I want to check that it has 9 rows, or the corresponding values in the column "X8Tile" ( Average, 1, 2, 3, 4, 5, 6, 7, 8). As an example the "QueueName" Overall has all of the necessary rows, but usci_helpdesk does not.
So my first priority is to at least identify if one of the unique values in "QueueName" does not have all of the necessary rows.
My second priority would be to remove all of the rows corresponding to a unique "QueueName" that does not meet the requirements.
Both these priorities are easily addressed using the Split-Apply-Combine paradigm, implemented in the plyr package.
Priority 1: Identify values of QueueName which don't have enough rows
require(plyr)
# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)
If you have lots of unique values of QueueName, you'll want to identify the values which are not equal to 9:
rowSummary[rowSummary$numRows !=9, ]
Priority 2: Eliminate rows for which QueueNamedoes not have enough rows
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)
(I don't quite understand the meaning of 'check that it has 9 rows, or the corresponding values in the column "X8Tile"). You could edit the repexampleEdit line based on your needs.
This is an approach that makes some assumptions about how your data are ordered. It can be modified (or your data can be reordered) if the assumption doesn't fit:
## Paste together the values from your "X8tile" column
## If all is in order, you should have "Average12345678"
## If anything is missing, you won't....
myMatch <- names(
which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x)
gsub("^\\s+|\\s+$", "", paste(x, collapse = ""))))
== "Average12345678"))
## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
# QueueName X8Tile Actual Calls Pop
# 1 Overall Average 508.1822 54948 41
# 2 Overall 1 334.6995 6896 6
# 3 Overall 2 404.9049 8831 5
# 4 Overall 3 469.4069 7825 5
# 5 Overall 4 489.2800 5768 5
# 6 Overall 5 516.5744 7943 5
# 7 Overall 6 551.7966 5796 5
# 8 Overall 7 601.5104 8698 5
# 9 Overall 8 720.9811 3191 5
# 14 CCM4.usci_retention_eng Average 535.2467 248 11
# 15 CCM4.usci_retention_eng 1 278.2500 11 2
# 16 CCM4.usci_retention_eng 2 409.9286 9 2
# 17 CCM4.usci_retention_eng 3 511.6635 94 2
# 18 CCM4.usci_retention_eng 4 553.0000 1 1
# 19 CCM4.usci_retention_eng 5 641.0000 65 1
# 20 CCM4.usci_retention_eng 6 676.1111 9 1
# 21 CCM4.usci_retention_eng 7 778.5517 29 1
# 22 CCM4.usci_retention_eng 8 886.3667 30 1
Similar approaches can be taken with aggregate+merge and similar tools.

Resources