Related
My goal is to perform multiple column operations in one line of code without hard coding the variable names.
structure(list(Subject = 1:6, Congruent_1 = c(359, 391, 384,
316, 287, 403), Congruent_2 = c(361, 378, 322, 286, 276, 363),
Congruent_3 = c(342, 355, 334, 274, 297, 335), Congruent_4 = c(365,
503, 324, 256, 266, 388), Congruent_5 = c(335, 354, 320,
272, 260, 337), Incongruent_1 = c(336, 390, 402, 305, 310,
400), Incongruent_2 = c(366, 407, 386, 280, 243, 393), Incongruent_3 = c(323,
455, 317, 308, 259, 325), Incongruent_4 = c(361, 392, 357,
274, 342, 350), Incongruent_5 = c(300, 366, 378, 263, 258,
349)), row.names = c(NA, 6L), class = "data.frame")
My data looks like this.
I want need to do column subtraction and save those new values into new columns. For example, a new column by the name of selhist_1 should be computed as Incongruent_1 - Congruent_1. I tried to write a for loop that indexes the existing columns by their names and creates new columns using the same indexing variable:
for(i in 1:5)(
DP4 = mutate(DP4, as.name(paste("selhistB_",i,sep="")) = as.name(paste("Incongruent_",i,sep="")) - as.name(paste("Congruent_",i,sep="")))
)
but I received this error:
Error: unexpected '=' in: "for(i in 1:5)( DP4 = mutate(DP4, as.name(paste("selhistB_",i,sep="")) ="
I rather use this modular approach, as opposed to hard coding and writing out "selhistB = incongruent_1 - congruent_1" five times, using the mutate() function.
I also wonder if i could achieve the same goal on the long version of this data, and maybe it would make more sense.
library(dplyr)
d %>%
pivot_longer(-Subject,
names_to = c(".value", "id"),
names_sep = "_") %>%
mutate(selhistB = Incongruent - Congruent) %>%
pivot_wider(names_from = id, values_from = c(Congruent, Incongruent, selhistB))
Or just skip the last pivot, and keep everything long.
As long as you are already using tidyverse packages, the following code will do exactly what you need:
library(dplyr)
for(i in 1:5){
DP4 <- DP4 %>% mutate(UQ(sym(paste0("selhistB_",i))) :=
UQ(sym(paste0("Incongruent_",i))) - UQ(sym(paste0("Congruent_",i))))
}
DP4
Subject Congruent_1 Congruent_2 Congruent_3 Congruent_4 Congruent_5
1 1 359 361 342 365 335
2 2 391 378 355 503 354
3 3 384 322 334 324 320
4 4 316 286 274 256 272
5 5 287 276 297 266 260
6 6 403 363 335 388 337
Incongruent_1 Incongruent_2 Incongruent_3 Incongruent_4 Incongruent_5
1 336 366 323 361 300
2 390 407 455 392 366
3 402 386 317 357 378
4 305 280 308 274 263
5 310 243 259 342 258
6 400 393 325 350 349
selhistB_1 selhistB_2 selhistB_3 selhistB_4 selhistB_5
1 23 -5 19 4 35
2 1 -29 -100 111 -12
3 -18 -64 17 -33 -58
4 11 6 -34 -18 9
5 -23 33 38 -76 2
6 3 -30 10 38 -12
You can use split.default and split on column names suffix, then loop over the list and subtract column 2 from column 1, i.e.
lapply(split.default(df[-1], sub('.*_', '', names(df[-1]))), function(i) i[1] - i[2])
Using subtract over all matching columns, then cbind, try:
x <- df1[, grepl("^C", colnames(df1)) ] - df1[, grepl("^I", colnames(df1)) ]
names(x) <- paste0("selhistB_", seq_along(names(x)))
res <- cbind(df1, x)
res
Subject Congruent_1 Congruent_2 Congruent_3 Congruent_4 Congruent_5
1 1 359 361 342 365 335
2 2 391 378 355 503 354
3 3 384 322 334 324 320
4 4 316 286 274 256 272
5 5 287 276 297 266 260
6 6 403 363 335 388 337
Incongruent_1 Incongruent_2 Incongruent_3 Incongruent_4 Incongruent_5
1 336 366 323 361 300
2 390 407 455 392 366
3 402 386 317 357 378
4 305 280 308 274 263
5 310 243 259 342 258
6 400 393 325 350 349
selhistB_1 selhistB_2 selhistB_3 selhistB_4 selhistB_5
1 23 -5 19 4 35
2 1 -29 -100 111 -12
3 -18 -64 17 -33 -58
4 11 6 -34 -18 9
5 -23 33 38 -76 2
6 3 -30 10 38 -12
I have a large data-set consisting of a header and a series of values in that column. I want to detect the presence and number of duplicates of these values within the whole dataset.
1 2 3 4 5 6 7
734 456 346 545 874 734 455
734 783 482 545 456 948 483
So for example, it would detect 734 3 times, 456 twice etc.
I've tried using the duplicated function in r but this seems to only work on rows as a whole or columns as a whole. Using
duplicated(df)
doesn't pick up any duplicates, though I know there are two duplicates in the first row.
So I'm asking how to detect duplicates both within and between columns/rows.
Cheers
You can use table() and data.frame() to see the occurrence
data.frame(table(v))
such that
v Freq
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 346 1
9 455 1
10 456 2
11 482 1
12 483 1
13 545 2
14 734 3
15 783 1
16 874 1
17 948 1
DATA
v <- c(1, 2, 3, 4, 5, 6, 7, 734, 456, 346, 545, 874, 734, 455, 734,
783, 482, 545, 456, 948, 483)
You can transform it to a vector and then use table() as follows:
library(data.table)
library(dplyr)
df<-fread("734 456 346 545 874 734 455
734 783 482 545 456 948 483")
df%>%unlist()%>%table()
# 346 455 456 482 483 545 734 783 874 948
# 1 1 2 1 1 2 3 1 1 1
I am trying to duplicated "manually" the example in this Wikipedia post using R.
Here is the data:
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
after before sgn abs
1 125 110 1 15
2 115 122 -1 7
3 130 125 1 5
4 140 120 1 20
5 140 140 0 0
6 115 124 -1 9
7 140 123 1 17
8 125 137 -1 12
9 140 135 1 5
10 135 145 -1 10
If I try to rank the rows based on the abs column, the 0 entry is naturally ranked as 1:
rank = rank(abs)
(d = data.frame(after,before,sgn,abs,rank))
after before sgn abs rank
1 125 110 1 15 8.0
2 115 122 -1 7 4.0
3 130 125 1 5 2.5
4 140 120 1 20 10.0
5 140 140 0 0 1.0
6 115 124 -1 9 5.0
7 140 123 1 17 9.0
8 125 137 -1 12 7.0
9 140 135 1 5 2.5
10 135 145 -1 10 6.0
However, zeros are ignored in the Wilcoxon signed-test.
How can I get R to ignore that row, so as to end up with:
after before sgn abs rank
1 125 110 1 15 7.0
2 115 122 -1 7 3.0
3 130 125 1 5 1.5
4 140 120 1 20 9.0
5 140 140 0 0 0
6 115 124 -1 9 4.0
7 140 123 1 17 8.0
8 125 137 -1 12 6.0
9 140 135 1 5 1.5
10 135 145 -1 10 5.0
SOLUTION (accepted answer below):
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
d$rank = rank(replace(abs,abs==0,NA), na='keep')
d$multi = d$sgn * d$rank
(W=abs(sum(d$multi, na.rm = T)))
9
From the Wikipedia article:
Exclude pairs with |x2,i − x1,i| = 0. Let Nr be the reduced sample size.
We need to exclude zeroes. By my thinking, you should replace zeroes with NA, and then specify to rank() that you want to exclude NAs from consideration for ranking. Since you need to return a vector of the same length as the input, you can specify 'keep' as the argument:
d$rank <- rank(replace(abs,abs==0,NA),na='keep');
d;
## after before sgn abs rank
## 1 125 110 1 15 7.0
## 2 115 122 -1 7 3.0
## 3 130 125 1 5 1.5
## 4 140 120 1 20 9.0
## 5 140 140 0 0 NA
## 6 115 124 -1 9 4.0
## 7 140 123 1 17 8.0
## 8 125 137 -1 12 6.0
## 9 140 135 1 5 1.5
## 10 135 145 -1 10 5.0
The subtraction-based solutions will not work if the input vector contains zero zeroes or multiple zeroes.
You could create the new column and then just update the rank where the abs value isn't 0
d$rank <- 0 # default value for rows with abs=0
d$rank[d$abs!=0] <- rank(d$abs[d$abs!=0])
If you wanted to drop the row completely, you could just do
transform(subset(d, abs!=0), rank=rank(abs))
A quick way to do it would be to rank as normal and then do:
d$rank <- ifelse(d$rank == 1, 0, d$rank - 1)
This switches all ranks of 1 to 0, and reduces any other ranks by 1.
I have some data with
h = c[157 144 80 106 124 46 207 188 190 208 143 170 162 178 155 163 162 149 135 160 149 147 133 146 126 120 151 74 122 145 160 155 173 126 172 93]
Then I get local maxima with diff function
max = [1 5 7 10 12 14 16 20 24 27 31 33 35 36]
I have simple matlab code to find spline interpolation
maxenv = spline(max,h(max),1:N);
This code will show result
maxenv = 157 86.5643152828762 67.5352696350679 84.9885891697257 124 169.645228239041 207 224.396380746179 223.191793341491 208 185.421032390413 170 172.173624690130 178 172.759468849065 163 158.147870987344 157.874968589889 159.414581897490 160 157.622863516083 153.308219179638 148.839465253375 146 146.051320982064 148.167322961480 151 153.474200222188 155.606188003845 157.685081783579 160 163.653263154551 173 186.027639098700 172 93
Now I'm trying with R but got some errors
maxenv <- spline(max, h(max), seq(N))
Then error
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
You're code and question are really not clear, so I'm not 100% sure this is what you are looking for:
# vector of values
h <- c(157, 144, 80, 106, 124, 46, 207, 188, 190, 208, 143, 170, 162, 178, 155,
163, 162, 149, 135, 160, 149, 147, 133, 146, 126, 120, 151, 74, 122, 145,
160, 155, 173, 126, 172, 93)
# local maxima
lmax <- h[c(1, which(diff(sign(diff(h)))==-2)+1, length(h))]
# spline calculation
spl <- spline(1:length(lmax), lmax)
# visual inspection
plot(spl)
lines(spl)
I have a table produced by calling table(...) on a column of data, and I get a table that looks like:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
346 351 341 333 345 415 421 425 429 437 436 469 379 424 387 419 392 396 381 421
I'd like to draw a boxplot of these frequencies, but calling boxplot on the table results in an error:
Error in Axis.table(x = c(333, 368.5, 409.5, 427, 469), side = 2) :
only for 1-D table
I've tried coercing the table to an array with as.array but it seems to make no difference. What am I doing wrong?
If I understand you correctly, boxplot(c(tab)) or boxplot(as.vector(tab)) should work (credit to #joran as well).