I've been given some data that I've combined into long form, but I need to get it into a certain format for a deliverable. I've tinkered with dataframe and list options and cannot seem to find a way to get the data I have into the output form I need. Any thoughts and solutions are appreciated.
If the desired output form seems odd for R, it is because other people will open the resulting data in Excel for additional study. So I will save the final data as a csv or Excel file. The full data in the desired form will have 40 rows (+header) and 110 columns (55 student and score pairs).
Here is example code for my long form data:
class
student
score
1
a
0.4977
1
b
0.7176
1
c
0.9919
1
d
0.3800
1
e
0.7774
2
f
0.9347
2
g
0.2121
2
h
0.6517
2
i
0.1256
2
j
0.2672
3
k
0.3861
3
l
0.0134
3
m
0.3824
3
n
0.8697
3
o
0.3403
Here is an example of how I need the final data to appear:
class_1_student
class_1_score
class_2_student
class_2_score
class_3_student
class_3_score
a
0.4977
f
0.9347
k
0.3861
b
0.7176
g
0.2121
l
0.0134
c
0.9919
h
0.6517
m
0.3824
d
0.3800
i
0.1256
n
0.8697
e
0.7774
j
0.2672
o
0.3403
Here is R code to generate the sample long form and desired form data:
set.seed(1)
d <- data.frame(
class=c(rep(1,5), rep(2,5), rep(3,5)),
student=c(letters[1:5], letters[6:10], letters[11:15]),
score=round(runif(15, 0, 1),4)
)
d2 <- data.frame(
class_1_student = d[1:5,2],
class_1_score = d[1:5,3],
class_2_student = d[6:10,2],
class_2_score = d[6:10,3],
class_3_student = d[11:15,2],
class_3_score = d[11:15,3]
)
If it's helpful, I also have the student and score data in separate matrices (1 row per student and 1 column per class) that I could use to help generate the final data.
You can just split data:
library(tidyverse)
split(select(d, -class), d$class) %>%
imap(~setNames(.x, str_c("class", .y, names(.x), sep = "_"))) %>%
bind_cols()
Column binding will work only if the groups are of equal sizes.
Related
I have two data frames. The first data frame information about the material numbers that has multiple values. For example:
df1 =
materialNumber value
A 10
A 20
A 30
A 40
B 1
B 2
B 43
C 12
C 19
and then another dataframe that only contains a single value for the same material number seen in df1.
df2=
Materialnumber Value
A 300
B 13
C 18
I am trying to determine if the values in data frame 2 are outliers compared to what is in data frame 1. I wrote a function to do this. However, I have over 10,000 material numbers.
What is the best way to group the material numbers and run this into a function?
as discussed in chat here follows the adpation of your code to df1 and df2 without further debugging:
dixon_test_results <- function(materialNumber,forecast){
EKPO_Values <- df1 %>%
dplyr::filter(materialNumber == materialNumber) %>%
dplyr::pull(value)
Q = abs(forecast-EKPO_Values[which.min(abs(EKPO_Values - forecast))])/diff(range(EKPO_Values))
print(Q)
# assumes 95% confidence
# reference: webspace.ship.edu/pgmarr/…
dixon_q_table_val <- switch(
length(EKPO_Values)-1, # assumes that the forecast is now part of the EKPO data set, but values assume n = 3, so
length(ekpo)+forecast()
0.9411,
0.7651,
0.6423,
0.5624,
0.5077,
0.4673,
0.4363,
0.4122,
0.3922,
0.3755,
0.3615,
0.3496,
0.3389,
0.3293,
0.3208,
0.3135,
0.3068,
0.3005,
0.2947,
0.2895,
0.2851,
0.2804,
0.2763,
0.2725,
0.2686,
0.2655,
0.2622,
0.2594
)
if (Q>dixon_q_table_val) {
return(materialNumber)
} else {
return(NA_charcter_)
}
}
df2 %>%
dplyr::mutate(res = dixon_test_results (Materialnumber , Value))
I am trying, or rather I wish I could try, to write a loop in R that executes the Wilcoxon test (wilcox.test) in an iterative way, comparing 2 groups of values in each row of a data.frame, and returning for each row the p-value that is then put in a dataframe with its associated row label.
The data.frame is as follows:
> tab[1:5,]
mol E12 E15 E22 E25 E26 E27 E38 E44 E47
1 A 7362.40 2475.93 3886.06 5825.59 6882.00 3250.05 3406.65 6416.29 7786.73
2 B 5391.42 2037.88 3330.05 4043.83 5766.20 2591.69 3603.95 14431.89 8320.70
3 C 1195.89 241.24 252.46 865.97 1970.28 899.22 346.36 1135.86 1179.31
4 D 502.64 171.41 434.29 508.22 419.34 260.13 298.14 326.70 167.07
5 E 181.63 171.41 165.30 150.47 164.09 109.19 122.76 212.74 155.60
Column labels are: mol, the specific molecule evaluated (about 20); E12 to E47 the samples for which the value of each molecule is measured.
Groups to be compared are:
P; samples E12, E25, E26, E27, E44. D; samples E15, E22, E38, E47.
The output should look like this:
mol p-value
A 1
B 0.5556
C 0.9048
etc.
I tried to use a for in cycle, but I am absolutely not able to manage it in this, for me complicated, context.
Any help with comments on the meaning of the instructions for a newbie like me is much appreciated.
apply() works like a looper on matrices and arrays. In this case, with margin=1 it loops along the rows. Each row, temporarily converted into a vector x, is passed on to function(x) wilcox.test(x[P], x[D])$p.value, the result being one p-value per row. P and D are logical vectors specifying which elements within x should be used in each sample.
tab0 <- read.table(text="mol E12 E15 E22 E25 E26 E27 E38 E44 E47
A 7362.40 2475.93 3886.06 5825.59 6882.00 3250.05 3406.65 6416.29 7786.73
B 5391.42 2037.88 3330.05 4043.83 5766.20 2591.69 3603.95 14431.89 8320.70
C 1195.89 241.24 252.46 865.97 1970.28 899.22 346.36 1135.86 1179.31
D 502.64 171.41 434.29 508.22 419.34 260.13 298.14 326.70 167.07
E 181.63 171.41 165.30 150.47 164.09 109.19 122.76 212.74 155.60",
header=TRUE)
tab <- as.matrix(tab0[,-1])
P <- colnames(tab) %in% c("E12", "E25", "E26", "E27", "E44")
D <- colnames(tab) %in% c("E15", "E22", "E38", "E47")
pv <- apply(tab, 1, function(x) wilcox.test(x[P], x[D])$p.value)
data.frame(tab0[1], p.val=signif(pv, 4))
# mol p.val
# 1 A 0.5556
# 2 B 0.4127
# 3 C 0.1111
# 4 D 0.1905
# 5 E 0.9048
i can figure out the solution of my problem but in a very not optimal way and thus the solution i have is not adapted for a large df. Let me explain.
I have a big dataframe and i need to create new columns by subtracting two others ones. Let me show you using a simple df.
A<-rnorm(10)
B<-rnorm(10)
C<-rnorm(10)
D<-rnorm(10)
E<-rnorm(10)
F<-rnorm(10)
df1<-data_frame(A,B,C,D,E,F)
# A tibble: 10 x 6
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.8750025 0.4685855 2.4435767 1.6999761 -1.3848386 -0.58992249
2 0.2551404 1.8555876 0.8365116 -1.6151186 -1.7754623 0.04423463
3 0.7740396 -1.0756147 0.6830024 -2.3879337 -1.3165875 -1.36646493
4 0.2059932 0.9322016 1.2483196 -0.1787840 0.3546773 -0.12874831
5 -0.4561725 -0.1464692 -0.7112905 0.2791592 0.5835127 0.16493237
6 1.2401795 -1.1422917 -0.6189480 -1.4975416 0.5653565 -1.32575021
7 -1.6173618 0.2283430 0.6154920 0.6082847 0.0273447 0.16771783
8 0.3340799 -0.5096500 -0.5270123 -0.2814217 -2.3732234 0.27972188
9 -0.4841361 0.1651265 0.0296500 0.4324903 -0.3895971 -2.90426195
10 -2.7106357 0.5496335 0.3081533 -0.3083264 -0.1341055 -0.17927807
I need (i) to subtract two columns at a similar distance : D-A, E-B, F-C while (ii) giving the new column a name based on the name of the initial variables' names.
I did in that way and it works:
df2<-df1 %>%
transmute (!!paste0("diff","D","A") := D-A,
!!paste0("diff","E","B") := E-B,
!!paste0("diff","F","C") := F-C)
# A tibble: 10 x 3
diffDA diffEB diffFC
<dbl> <dbl> <dbl>
1 4.5749785 -1.8534241 -3.0334991
2 -1.8702591 -3.6310500 -0.7922769
3 -3.1619734 -0.2409728 -2.0494674
4 -0.3847772 -0.5775242 -1.3770679
5 0.7353317 0.7299819 0.8762229
6 -2.7377211 1.7076482 -0.7068022
7 2.2256465 -0.2009983 -0.4477741
8 -0.6155016 -1.8635734 0.8067342
9 0.9166264 -0.5547236 -2.9339120
10 2.4023093 -0.6837390 -0.4874314
However, i have many columns and i would like to find a way to make the code simpler. I tried many things (like with mutate_all, mutate_at or add_columns) but nothing works...
OK, here's a method that will work for the full width of your data set.
df1 <- tibble(A = rnorm(10),
B = rnorm(10),
C = rnorm(10),
D = rnorm(10),
E = rnorm(10),
F = rnorm(10),
G = rnorm(10),
H = rnorm(10),
I = rnorm(10))
ct <- 1:ncol(df1)
diff_tbl <- tibble(testcol = rnorm(10))
for (i in ct) {
new_tbl <- tibble(col = df1[[i+3]] - df1[[i]])
names(new_tbl)[1] <- paste('diff',colnames(df1[i+3]),colnames(df1[i]),sep='')
diff_tbl <- bind_cols(diff_tbl,new_tbl)
}
diff_tbl <- diff_tbl %>%
select(-testcol)
df1 <- bind_cols(df1,diff_tbl)
Basically, what you are doing is creating a second dummy tibble to compute the differences, iterating over the possible differences (i.e. gaps of three columns) then assembling them into a single tibble, then binding those columns to the original tibble. As you can see, I extended df1 by three extra columns and the whole thing worked like a charm.
It's probable that there's a more elegant way to do this, but this method definitely works. There's one slightly awkward thing in that I had to create the diff_tbl with a dummy column and then remove it before the final bind_cols() call, but it's not a major thing, I think.
You could divide the data frame in two parts and do
inds <- ncol(df1)/2
df1[paste0("diff", names(df1[(inds + 1):ncol(df1)]), names(df1[1:inds]))] <-
df1[(inds + 1):ncol(df1)] - df1[1:inds]
Note that column names with dashes in them are improper and not recommended.
result = df1[4:6] - df1[1:3]
names(result) = paste(names(df1)[4:6], names(df1)[1:3], sep = "-")
result
# D-A E-B F-C
# 1 0.12459065 0.05855622 0.6134559
# 2 -2.65583389 0.26425762 0.8344115
# 3 -1.48761765 -3.13999402 1.3008065
# 4 -4.37469763 1.37551178 1.3405191
# 5 1.01657135 -0.90690359 1.5848562
# 6 -0.34050959 -0.57687686 -0.3794937
# 7 0.85233808 0.57911293 -0.8896393
# 8 0.01931559 0.91385740 3.2685647
# 9 -0.62012982 -2.34166712 -0.4001903
# 10 -2.21764146 0.05927664 0.3965072
I have a table that looks like this-
LDAutGroup PatientDays ExposedDays sex Ageband DrugGroup Prop LowerCI UpperCI concat
Group1 100 23 M 5 to 10 PSY 23 15.84 32.15 23 (15.84 -32.15) F
Group2 500 56 F 11 to 17 HYP 11.2 8.73 14.27 11.2 (8.73 -14.27)
Group3 300 89 M 18 and over PSY 29.67 24.78 35.07 29.67 (24.78 -35.07)
Group1 200 34 F 5 to 10 PSY 17 12.43 22.82 17 (12.43 -22.82)
Group2 456 78 M 11 to 17 ANX 17.11 13.93 20.83 17.11 (13.93 -20.83)
Following this, I want a pivot table to lay out the concat column as the valuename. However, the pivottabler only works on integers or numeric values. The following code runs right with either of the Prop, LowerCI or UpperCI columns on their own, but gives an error message for the concat column-
library(readr)
library(dplyr)
library(epitools)
library(gtools)
library(reshape2)
library(binom)
library(pivottabler)
pt <- PivotTable$new()
pt$addData(a)
pt$addColumnDataGroups("LDAutGroup")
pt$addColumnDataGroups("sex")
pt$addRowDataGroups("DrugGroup")
pt$addRowDataGroups("Ageband")
pt$defineCalculation(calculationName="TotalTrains", type="value", valueName="Prop")
pt$renderPivot()
Is there a way I can make this work on the concat column? I want a table that has the following layout and the cells populated with the strings in concat column in the table above
Group1 Group2 Group3
M F M F M F
ANX 11 to 17
18 and over
Total
HYP 11 to 17
18 and over
5 to 10
Total
PSY 18 and over
5 to 10
Total
I am the pivottabler package author.
As you say, pivottabler currently only pivots integer/numerical columns. A workaround exists however, using a custom cell calculation function to calculate the value in each cell. Custom calculation functions were intended for more complex use cases, so using them in this way is a sledgehammer approach, but it does the job, and I suppose makes sense in some scenarios, e.g. if you have other numerical pivot tables and want a uniform appearance for the pivot tables in your output.
Adapting an example from the package vignettes:
library(pivottabler)
library(dplyr)
trainsConcatendated <- mutate(bhmtrains, ConcatValue = paste(TOC, TrainCategory, sep=" "))
getConcatenatedValue <- function(pivotCalculator, netFilters, format, baseValues, cell) {
# get the data frame
trains <- pivotCalculator$getDataFrame("trainsConcatendated")
# apply the filters coming from the headers in the pivot table
filteredTrains <- pivotCalculator$getFilteredDataFrame(trains, netFilters)
# get the distinct values
distinctValues <- distinct(filteredTrains, ConcatValue)
# get the value of the concatenated column
# this just returns the first concatenated value for the cell
# if there are multiple values, the others are ignored
if(length(distinctValues$ConcatValue)==0) { tv <- "" }
else { tv <- distinctValues$ConcatValue[1] }
# build the return value
# the raw value must be numerical, so simply set this to zero
value <- list()
value$rawValue <- 0
value$formattedValue <- tv
return(value)
}
pt <- PivotTable$new()
pt$addData(trainsConcatendated)
pt$addColumnDataGroups("TrainCategory", addTotal=FALSE)
pt$addRowDataGroups("TOC", addTotal=FALSE)
pt$defineCalculation(calculationName="ConcatValue",
type="function", calculationFunction=getConcatenatedValue)
pt$renderPivot()
Results:
It is speculative to apply the same function for CI (lower or upper) as it could be for mean statistics to report subtotal levels, as well as no sense for concat to report subtotals (at least in the simple way of pivot table).
With no subtotal you can easily use tidyr library and report variable with character type in spread format of table : here is 2 line code. First is create groups for columns and second line is to change table format to spread version
library(tidyr)
Table_Original <- unite(Table_Original, "Col_pivot", c("LDAutGroup", "sex"), sep = "_", remove = F)
Table_Pivot <- spread(Table_Original[ ,c("Col_pivot","DrugGroup", "Ageband", "concat")], Col_pivot, concat)
This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978