R - How to split/combine columns for multiple variables [duplicate] - r

This question already has answers here:
R - reshaping 2 column data frame to multiple column matrix [duplicate]
(4 answers)
Reshaping 2 column data.table from long to wide
(2 answers)
Closed 5 years ago.
I'm quite the novice at R and I haven't been able to find an answer of how to split a column with multiple variables (sample 1-4) into separate columns whilst moving the data it correlates with. Here's an example:
Samples Content
Sample 1 70.7
Sample 1 91.6
Sample 1 92.6
Sample 1 65.2
Sample 1 80.0
Sample 1 82.1
Sample 1 88.1
Sample 1 92.2
Sample 1 53.3
Sample 1 80.0
Sample 1 60.3
Sample 1 89.7
Sample 1 84.8
Sample 1 94.0
Sample 1 71.8
Sample 1 76.9
Sample 1 91.4
Sample 1 57.9
Sample 1 61.9
Sample 1 71.5
Sample 2 88.7
Sample 2 67.6
Sample 2 61.7
Sample 2 70.8
Sample 2 45.3
Sample 2 55.6
Sample 2 64.6
Sample 2 62.7
Sample 2 72.4
Sample 2 46.8
Sample 2 59.0
Sample 2 63.7
Sample 2 67.0
Sample 2 71.6
Sample 2 48.3
Sample 2 55.6
Sample 2 62.5
Sample 2 60.0
Sample 2 72.9
Sample 2 47.4
Sample 3 42.3
Sample 3 48.2
Sample 3 64.0
Sample 3 33.3
Sample 3 19.0
Sample 3 41.0
Sample 3 53.1
Sample 3 46.5
Sample 3 30.0
Sample 3 43.4
Sample 3 43.7
Sample 3 92.0
Sample 3 53.0
Sample 3 33.0
Sample 3 48.4
Sample 3 43.2
Sample 3 41.8
Sample 3 62.5
Sample 3 33.3
Sample 3 49.3
Sample 4 51.8
Sample 4 57.3
Sample 4 43.3
Sample 4 42.3
Sample 4 37.6
Sample 4 54.9
Sample 4 71.1
Sample 4 33.8
Sample 4 43.1
Sample 4 39.1
Sample 4 63.0
Sample 4 74.0
Sample 4 31.0
Sample 4 48.3
Sample 4 42.9
Sample 4 62.2
Sample 4 35.4
Sample 4 33.8
Sample 4 40.7
Sample 4 41.2
I tried tidyr with no success. I want the output to be something like this;
Sample 1 Sample 2 Sample 3 Sample 4
70.7 88.7 42.3 51.8
91.6 67.6 48.2 57.3
92.6 61.7 64.0 43.3
65.2 70.8 33.3 42.3
80.0 45.3 19.0 37.6
82.1 55.6 41.0 54.9
88.1 64.6 53.1 71.1
92.2 62.7 46.5 33.8
53.3 72.4 30.0 43.1
80.0 46.8 43.4 39.1
60.3 59.0 43.7 63.0
89.7 63.7 92.0 74.0
84.8 67.0 53.0 31.0
94.0 71.6 33.0 48.3
71.8 48.3 48.4 42.9
76.9 55.6 43.2 62.2
91.4 62.5 41.8 35.4
57.9 60.0 62.5 33.8
61.9 72.9 33.3 40.7
71.5 47.4 49.3 41.2
Many thanks, if a solution is identified, is there an answer if I wanted to do the reciprocate?
Extra - Is there any way to preform a t-test on data which is stacked in one column such as the first example without having to transform it?

You may be having the "duplicate identifiers" issue using tidyr::spread. You need first to generate unique combinations of Sample + identifier, which you can do like this (assuming data frame named df1):
library(tidyverse) # for dplyr + tidyr
df1 %>%
group_by(Samples) %>%
mutate(id = row_number()) %>%
spread(Samples, Content) %>%
select(-id)
"if I wanted to do the reciprocate"
Do you mean go the other way, from the wide form back to the original long form? Then you use gather. Add this to the end of the code above and see what happens:
%>% gather(Samples, Content)
t-test: there are lots of ways you could run a t-test on the long format data. For example, a base R way to compare Samples 1 and 2 might be:
t.test(df1[df1$Samples == "Sample 1", "Content"],
df1[df1$Samples == "Sample 2", "Content"])

As the number of elements for each 'Sample' is the same, we can use unstack from base R
unstack(df1, Content~Samples)
# Sample.1 Sample.2 Sample.3 Sample.4
#1 70.7 88.7 42.3 51.8
#2 91.6 67.6 48.2 57.3
#3 92.6 61.7 64.0 43.3
#4 65.2 70.8 33.3 42.3
#5 80.0 45.3 19.0 37.6
#6 82.1 55.6 41.0 54.9
#7 88.1 64.6 53.1 71.1
#8 92.2 62.7 46.5 33.8
#9 53.3 72.4 30.0 43.1
#10 80.0 46.8 43.4 39.1
#11 60.3 59.0 43.7 63.0
#12 89.7 63.7 92.0 74.0
#13 84.8 67.0 53.0 31.0
#14 94.0 71.6 33.0 48.3
#15 71.8 48.3 48.4 42.9
#16 76.9 55.6 43.2 62.2
#17 91.4 62.5 41.8 35.4
#18 57.9 60.0 62.5 33.8
#19 61.9 72.9 33.3 40.7
#20 71.5 47.4 49.3 41.2
No external packages are used
If the number of 'Sample' elements are different, then dcast from data.table can be used (works in both cases)
library(data.table)
dcast(setDT(df1), rowid(Samples)~Samples, value.var = "Content")

Related

Matrices of difent size multiplication

Good evenning
In Rstudio
I have a problem multiplying these two matrices of a different size, and it becomes worse because I have to multiply in such a way that the values in the row d2$ID=1 have to multiply only the repetitions of w$sample=1.
sample and ID are indicating is the same sample
In other words, from the "subset" d2$ID=1, every single value ("L1", "ST", "GR", "CB", "HSK", "DDM") has to multiply the whole "subset" w$sample=1 (4 rows in this case, but not always), yes, all the values "G2", "G4", "G6", "G8", "G12"
>d2
ID L1 ST GR CB HSK DDM
1 1 0.1662000 0.2337000 0.3637000 0.11110000 0.10100000 0.024300000
2 2 0.1896576 0.2280830 0.3705740 0.09406879 0.09319434 0.024422281
3 3 0.1110259 0.2217769 0.4180797 0.11122498 0.10902635 0.028866094
4 4 0.1558785 0.2008862 0.4222565 0.09805538 0.10218119 0.020742172
5 5 0.1536421 0.1674096 0.4205395 0.14362176 0.08635519 0.028431849
6 6 0.1841964 0.1514189 0.4603306 0.10243621 0.08928011 0.012337688
> w
sample G2 G4 G6 G8 G12
1 1 10.9 15.9 21.4 28.0 37.8
2 1 11.5 16.6 22.2 29.5 38.3
3 1 10.3 15.1 20.7 28.3 36.7
4 1 11.7 18.1 24.8 31.2 39.5
5 2 11.0 16.8 22.4 30.6 38.0
6 2 10.1 15.9 22.5 30.2 36.7
7 2 12.8 17.8 22.8 28.7 37.1
8 2 11.8 16.3 20.8 27.3 34.7
9 2 11.9 16.7 21.6 28.3 34.6
10 3 12.0 18.1 24.2 30.9 40.0
11 3 12.2 17.7 24.2 31.7 40.5
12 4 11.1 16.5 22.7 31.0 39.2
13 4 12.5 19.8 27.4 32.8 38.8
14 4 12.4 19.2 25.8 33.0 39.9
15 4 12.4 19.2 26.2 33.4 38.9
16 4 13.4 18.3 23.7 30.0 38.2
17 5 13.3 18.6 24.0 30.7 38.4
18 5 13.3 18.1 22.9 30.1 36.8
19 5 13.7 19.9 26.5 33.8 43.0
20 5 12.7 18.2 24.6 32.5 41.3
21 6 12.1 17.5 24.3 33.7 42.2
22 6 14.5 20.8 28.4 35.3 43.7
I have check already a lot of questions but I can't figure it out, specially because most of the information is for matrices of the same size.
I tried by filtering the data from d2, but the data set is really big, then is really inefficient.
I am a beginner, if you consider is so easy I would appreciate at least a hint, please!
I have several data sets like these ones...
Thanks in advance!
This seems to perform as requested:
res <- apply(w, 1, function(x){ unclass(
outer(as.matrix( x[-1] ),
as.matrix( d2[1, c( "L1", "ST", "GR", "CB", "HSK", "DDM")])))})
str(res)
# result
# num [1:30, 1:22] 1.81 2.64 3.56 4.65 6.28 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:22] "1" "2" "3" "4" ...
I almost got it right on the first pass but after some debugging found that I needed to add the as.matrix call to both arguments inside outer (so to speak ;-). To explain my logic ... I wanted to run down each row of w with apply and then use match on the value of the first column (of each row of w) to the unique row of d2. The match function is designed for just this purpose, to return a suitable number to be used for indexing. Then with the rest of the row (x[-1] by the time it was passed through the function call), I would use outer on the row values crossed with the desired row and columns of d2. If you do it without the as.matrix calls you get an error message:
Error in tcrossprod(x, y) :
requires numeric/complex matrix/vector arguments
I don't think that's a very informative error message. Both of the arguments were numeric vectors.

subtract from column with the maximum values in sequential manner in R

I am still learning how to do loops and if-else statements in R. I can do the process in long hand method but I am going to implement them in a large dataset so I need to process them in loops/if-else.
My data looks a little bit like the sample data frame below. One of the columns contain the column number of the maximum value within the row:
x1 x2 x3 x4 x5 x6 x7 max_index max_val
1 56.1 56.8 99.4 44.6 50.4 74.9 17.7 3 99.4
2 9.1 46.1 74.2 64.3 62.3 68.8 85.7 7 85.7
3 83.3 84.5 18.4 93.2 17.6 69.7 23.4 4 93.2
4 94.0 9.7 46.8 25.0 96.9 69.2 94.8 5 96.9
5 21.5 64.1 89.1 87.7 59.7 88.0 73.5 3 89.1
6 53.0 94.9 87.2 19.6 55.9 48.5 82.9 2 94.9
7 52.2 79.1 20.6 9.9 18.3 21.5 92.5 7 92.5
8 42.5 33.0 36.9 45.0 43.9 7.6 45.3 7 45.3
9 89.3 20.6 41.7 74.8 67.4 21.0 49.1 1 89.3
10 21.2 92.6 86.3 76.3 68.6 44.8 8.8 2 92.6
What I want to do is subtract the 3 succeeding columns (from the maximum) from each other like this:
j1 <- max.col(df[,1:7], "first")
df$max_index <- j1
df$max_val <- df[cbind(1:nrow(df), j1)]
i1 <- j1 + 1
i2 <- i1 + 1
i3 <- i2 +1
value <- df[cbind(1:nrow(df), j1)]
value1 <- df[cbind(1:nrow(df), i1)]
value2 <- df[cbind(1:nrow(df), i2)]
value3 <- df[cbind(1:nrow(df), i3)]
df$max_val <- value
df$max.up1 <- value1
df$max.up2 <- value2
df$max.up3 <- value3
df_x1 <- df$max_val - df$max.up1
df_x2 <- df$max.up1 - df$max.up2
df_x3 <- df$max.up2 - df$max.up3
After doing that, I would like to know if all 3 outputs (df_x1, df_x2, df_x3) are all positive and add a column that says "TRUE" and "FALSE" if not.
I would like my final dataframe to look like this:
x1 x2 x3 x4 x5 x6 x7 max_index max_val t.or.f
1 56.1 56.8 99.4 44.6 50.4 74.9 17.7 3 99.4 FALSE
2 9.1 46.1 74.2 64.3 62.3 68.8 85.7 7 85.7 NA
3 83.3 84.5 18.4 93.2 17.6 69.7 23.4 4 93.2 FALSE
4 94.0 9.7 46.8 25.0 96.9 69.2 94.8 5 96.9 NA
5 21.5 64.1 89.1 87.7 59.7 88.0 73.5 3 89.1 FALSE
6 53.0 94.9 87.2 19.6 55.9 48.5 82.9 2 94.9 FALSE
7 52.2 79.1 20.6 9.9 18.3 21.5 92.5 7 92.5 FALSE
8 42.5 33.0 36.9 45.0 43.9 7.6 45.3 7 45.3 FALSE
9 89.3 20.6 41.7 74.8 67.4 21.0 49.1 1 89.3 FALSE
10 21.2 92.6 86.3 76.3 68.6 44.8 8.8 2 92.6 TRUE
How will I simplify my code? Thanks!
I here is data.table solution with structured data approach:
library(data.table)
dt.m <- read.table(text = "
x1 x2 x3 x4 x5 x6 x7 max_index max_val
1 56.1 56.8 99.4 44.6 50.4 74.9 17.7 3 99.4
2 9.1 46.1 74.2 64.3 62.3 68.8 85.7 7 85.7
3 83.3 84.5 18.4 93.2 17.6 69.7 23.4 4 93.2
4 94.0 9.7 46.8 25.0 96.9 69.2 94.8 5 96.9
5 21.5 64.1 89.1 87.7 59.7 88.0 73.5 3 89.1
6 53.0 94.9 87.2 19.6 55.9 48.5 82.9 2 94.9
7 52.2 79.1 20.6 9.9 18.3 21.5 92.5 7 92.5
8 42.5 33.0 36.9 45.0 43.9 7.6 45.3 7 45.3
9 89.3 20.6 41.7 74.8 67.4 21.0 49.1 1 89.3
10 21.2 92.6 86.3 76.3 68.6 44.8 8.8 2 92.6", header = TRUE)
dt.m <- data.table(dt.m)
dt.m[, row.id := 1:.N]
# melt data to make it easy to work with, excluding max.val and max.index
dt <- melt(data = dt.m, measure.vars = 1:7, id.vars = "row.id")
# replicate max.val and max.index which are already provided in example
dt[, max.val := max(value), by = row.id]
dt[, max.index := which(value == max.val), by = row.id]
dt[, x.index := 1:.N, by = row.id]
# filter to values after the max value
out <- dt[x.index >= max.index]
# keep max value and 3 values post max value
out <- out[, post.max.index := 1:.N, by = row.id][post.max.index <= 4]
out <- out[order(row.id, x.index)]
out[, previous.x := shift(value)]
out[, change.x := previous.x - value]
out <- out[max.index != x.index]
# check if all values are positive
res <- out[, .(all.next.positive = all(change.x > 0)), by = row.id]
# add result to the original data
dt.m <- merge(dt.m, res, by = "row.id", all.x = TRUE)

Adding summary columns programmatically

I have dataframe X01 whose columns I should summarize with mean, max and min
> head(X01)
B01002e2 B01002e3
1 39.6 47.3
2 37.0 44.8
3 52.6 49.8
4 35.5 26.7
5 39.4 23.9
6 40.8 39.8
My objective is to add min, max, and mean following each column. So far, I have done this manually by rearranging column order, but I will soon have data with many columns which makes this approach very slow:
X01$B01002e2_min <- min(X01$B01002e2, na.rm = TRUE)
X01$B01002e2_max <- max(X01$B01002e2, na.rm = TRUE)
X01$B01002e2_mean <- mean(X01$B01002e2, na.rm = TRUE)
X01$B01002e3_min <- min(X01$B01002e3, na.rm = TRUE)
X01$B01002e3_max <- max(X01$B01002e3, na.rm = TRUE)
X01$B01002e3_mean <- mean(X01$B01002e3, na.rm = TRUE)
X01 <- X01[ , c(1,3,4,5,2,6,7,8)]
> head(X01)
B01002e2 B01002e2_min B01002e2_max B01002e2_mean B01002e3 B01002e3_min B01002e3_max
1 39.6 6 83.7 35.3427547 47.3 8.9 90.8
2 37.0 6 83.7 35.3427547 44.8 8.9 90.8
3 52.6 6 83.7 35.3427547 49.8 8.9 90.8
4 35.5 6 83.7 35.3427547 26.7 8.9 90.8
5 39.4 6 83.7 35.3427547 23.9 8.9 90.8
6 40.8 6 83.7 35.3427547 39.8 8.9 90.8
B01002e3_mean
1 37.6894248
2 37.6894248
3 37.6894248
4 37.6894248
5 37.6894248
6 37.6894248
Is there a solution in R to add these columns after each column being processed in one step, for example with addmargins() ?
dput(head(X01))
structure(list(B01002e2 = c(39.6, 37, 52.6, 35.5, 39.4, 40.8),
B01002e3 = c(47.3, 44.8, 49.8, 26.7, 23.9, 39.8)), .Names = c("B01002e2",
"B01002e3"), row.names = c(NA, 6L), class = "data.frame")
Here's a dplyr approach:
library(dplyr)
X01 %>% mutate_all(funs(max, mean, min))
B01002e2 B01002e3 B01002e2_max B01002e3_max B01002e2_mean B01002e3_mean B01002e2_min B01002e3_min
1 39.6 47.3 52.6 49.8 40.81667 38.71667 35.5 23.9
2 37.0 44.8 52.6 49.8 40.81667 38.71667 35.5 23.9
3 52.6 49.8 52.6 49.8 40.81667 38.71667 35.5 23.9
4 35.5 26.7 52.6 49.8 40.81667 38.71667 35.5 23.9
5 39.4 23.9 52.6 49.8 40.81667 38.71667 35.5 23.9
6 40.8 39.8 52.6 49.8 40.81667 38.71667 35.5 23.9
If you want to ignore NA then you can add na.rm=TRUE:
X01[3,1] = NA
X01 %>% mutate_all(funs(max, mean, min), na.rm=TRUE)
B01002e2 B01002e3 B01002e2_max B01002e3_max B01002e2_mean B01002e3_mean B01002e2_min B01002e3_min
1 39.6 47.3 40.8 49.8 38.46 38.71667 35.5 23.9
2 37.0 44.8 40.8 49.8 38.46 38.71667 35.5 23.9
3 NA 49.8 40.8 49.8 38.46 38.71667 35.5 23.9
4 35.5 26.7 40.8 49.8 38.46 38.71667 35.5 23.9
5 39.4 23.9 40.8 49.8 38.46 38.71667 35.5 23.9
6 40.8 39.8 40.8 49.8 38.46 38.71667 35.5 23.9
If you just want the summary values as a new data frame, you can do this:
X01 %>% summarise_all(funs(max, mean, min), na.rm=TRUE)
B01002e2_max B01002e3_max B01002e2_mean B01002e3_mean B01002e2_min B01002e3_min
1 40.8 49.8 38.46 38.71667 35.5 23.9
Here's an attempt using a functional approach to loop over each column and function:
funs <- c("min","max","mean")
cbind(
dat,
unlist(Map(function(f,d) lapply(d,f), mget(funs, inherits=TRUE), list(dat) ), rec=FALSE)
)
# B01002e2 B01002e3 min.B01002e2 min.B01002e3 max.B01002e2 max.B01002e3 mean.B01002e2 mean.B01002e3
#1 39.6 47.3 35.5 23.9 52.6 49.8 40.81667 38.71667
#2 37.0 44.8 35.5 23.9 52.6 49.8 40.81667 38.71667
#3 52.6 49.8 35.5 23.9 52.6 49.8 40.81667 38.71667
#4 35.5 26.7 35.5 23.9 52.6 49.8 40.81667 38.71667
#5 39.4 23.9 35.5 23.9 52.6 49.8 40.81667 38.71667
#6 40.8 39.8 35.5 23.9 52.6 49.8 40.81667 38.71667

Inserting another column to a data frame and incrementing its value per row

I have this data frame:
head(df,10)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
3 36.4 13.1 13.9 36.6 9.26 57.9 28.0 34.96 26049 3492
4 31.1 11.2 12.6 45.1 7.81 48.8 25.9 37.85 17515 2754
5 33.2 13.4 13.2 40.3 8.69 54.3 26.9 35.67 23510 3265
6 34.0 12.8 13.7 39.4 8.77 54.8 26.5 35.19 25151 3305
7 32.7 12.4 13.6 41.3 8.49 53.0 25.9 35.97 25214 3201
8 33.4 13.7 12.5 40.3 8.76 54.7 27.1 36.50 23943 3391
9 35.2 13.8 13.5 37.5 9.20 57.5 27.8 33.08 25647 3385
10 34.6 14.9 14.9 35.6 9.35 58.4 27.8 35.81 27324 3790
11 30.4 13.3 13.0 43.3 8.29 51.8 24.9 38.31 25178 2881
12 32.0 13.3 14.0 40.7 8.58 53.6 26.1 35.97 25677 3162
I have DateTime is this:
DateTime<-Sys.time()
I would like to insert another column this df and increment the DateTime value by 30 seconds for each row.
Im doing this:
for (i in 1:nrow(df)) {
df[1,]$DateTime<-DateTime
DateTime<-DateTime+30
}
This loop is not doing what Im trying to do. Any help is greatly appreicated.
df$DateTime <- Sys.time() + 30 * (seq_len(nrow(df))-1)

R package for calculating partial coefficient of determination? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Anyone know of an R package for calculating partial R^2 in multiple regression? I've tried the command partial.R2 from package asbio, but it is giving error messages even with the example from supplied documentation.
Many thanks.
I've found out that command lm.sumSquares from package lmSupport provides by partial and semipartial correlations.
Data from 'Applied Linear Statistical Models' by John Neter, Michael H Kutner, William Wasserman, Christopher J. Nachtsheim
Section 7.4 in page 274:
# body fat example from Neter et al. via rhelp archives:
bf.dat <- read.table(text="x1 x2 x3 y
1 19.5 43.1 29.1 11.9
2 24.7 49.8 28.2 22.8
3 30.7 51.9 37.0 18.7
4 29.8 54.3 31.1 20.1
5 19.1 42.2 30.9 12.9
6 25.6 53.9 23.7 21.7
7 31.4 58.5 27.6 27.1
8 27.9 52.1 30.6 25.4
9 22.1 49.9 23.2 21.3
10 25.5 53.5 24.8 19.3
11 31.1 56.6 30.0 25.4
12 30.4 56.7 28.3 27.2
13 18.7 46.5 23.0 11.7
14 19.7 44.2 28.6 17.8
15 14.6 42.7 21.3 12.8
16 29.5 54.4 30.1 23.9
17 27.7 55.3 25.7 22.6
18 30.2 58.6 24.6 25.4
19 22.7 48.2 27.1 14.8
20 25.2 51.0 27.5 21.1 ", header=TRUE)
library(rms) # will also load Hmisc
fit <- ols(y ~ x1 + x2, data=bf.dat)
plt <- plot(anova(fit), what='partial R2')
plt
# x2 x1
#0.066955220 0.007010427

Resources