I've downloaded a table from wikipedia and in some columns there are links next to numbers. Is this possible to delete it ?
In column in Rstudio it looks like this:
402[38]
[38] - this is what I don't want.
We can do this easily in base R with Regex:
a <- data.frame(V1 = paste0(1:20, sprintf("[%s]", 50:70))
a$V2 <- gsub("\\[.*?\\]","", a$V1)
V1 V2
1 1[50] 1
2 2[51] 2
3 3[52] 3
4 4[53] 4
5 5[54] 5
6 6[55] 6
7 7[56] 7
8 8[57] 8
9 9[58] 9
10 10[59] 10
11 11[60] 11
12 12[61] 12
13 13[62] 13
14 14[63] 14
15 15[64] 15
16 16[65] 16
17 17[66] 17
18 18[67] 18
19 19[68] 19
20 20[69] 20
21 1[70] 1
And this conveniently works for the case of multiple references as well:
a <- data.frame(V1 = paste0(1:20, sprintf("[%s][%s]", 50:70, 80:100)))
Related
How to write an R-script to initialize a vector with integers, rearrange the elements by interleaving the
first half elements with the second half elements and store in the same vector without using pre-defined function and display the updated vector.
This sounds like a homework question, and it would be nice to see some effort on your own part, but it's pretty straightforward to do this in R.
Suppose your vector looks like this:
vec <- 1:20
vec
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Then you can just do:
c(t(cbind(vec[1:10], vec[11:20])))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
This works by joining the two vectors into a 10 x 2 matrix, then transposing that matrix and turning it into a vector.
We may use matrix directly and concatenate
c(matrix(vec, nrow = 2, byrow = TRUE))
-output
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
data
vec <- 1:20
Or using mapply:
vec <- 1:20
c(mapply(\(x,y) c(x,y), vec[1:10], vec[11:20]))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
We can try this using order + %%
> vec[order((seq_along(vec) - 1) %% (length(vec) / 2))]
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
Another way is to use rbind on the 2 halves of the vector, which creates a matrix with two rows. Then, we can then turn the matrix into a vector, which will go through column by column (i.e., 1, 11, 2, 12...). However, this will only work for even vectors.
vec <- 1:20
c(rbind(vec[1:10], vec[11:20]))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
So, for uneven vectors, we can use order, which will return the indices of the numbers in the two seq_along vectors.
vec2 <- 1:21
order(c(seq_along(vec2[1:10]),seq_along(vec2[11:21])))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20 21
The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.
I have a dataset consisting of two variables, Contents and Time like so:
Time Contents
2017M01 123
2017M02 456
2017M03 789
. .
. .
. .
2018M12 789
Now I want to create a numeric vector that aggregates Contents for six months, that is I want to sum 2017M01 to 2017M06 to one number, 2017M07 to 2017M12 to another number and so on.
I'm able to do this by indexing but I want to be able to write: "From 2017M01 to 2017M06 sum contents corresponding to that sequence" in my code.
I would really appreciate some help!
You can create a grouping variable based on the number of rows and number of elements to group. For your case, you want to group every 6 rows so your data frame should be divisible with 6. Using iris to demonstrate (It has 150 rows, so 150 / 6 = 25)
rep(seq(nrow(iris)%/%6), each = 6)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
There are plenty of ways to handle how you want to call it. Here is a custom function that allows you to do that (i.e. create the grouping variable),
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
return(rep(seq(nrow(df)%/%i1), each = i1))
}
f1("2017M01:2017M06", iris)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
EDIT: We can easily make the function compatible with 'non-0-remainder' divisions by concatenating the final result with a repetition of the max+1 value of the final result of remainder times, i.e.
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
final_v <- rep(seq(nrow(df) %/% i1), each = i1)
if (nrow(df) %% i1 == 0) {
return(final_v)
} else {
remainder = nrow(df) %% i1
final_v1 <- c(final_v, rep((max(final_v) + 1), remainder))
return(final_v1)
}
}
So for a data frame with 20 rows, doing groups of 6, the above function will yield the result:
f1("2017M01:2017M06", df)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4
My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)
I have a dataset and I want to display the numbers for each row between col1 and col2 counted by col3 using R:
dataset=data.frame(col1=c(3,9,15), col2=c(4,11,16), col3=c(2,3,2))
My result should look like:
3
3
4
4
9
9
9
10
10
10
11
11
11
15
15
16
16
Seems trivial but I cannot get a for loop work. Thanks.
Or this can be done with apply
unlist(apply(dataset, 1, function(x) rep(x[1]:x[2],
each=x[3])))
#[1] 3 3 4 4 9 9 9 10 10 10 11 11 11 15 15 16 16
Try this:
col1=c(3,9,15)
col2=c(4,11,16)
col3=c(2,3,2)
res = NULL
for (k in 1:length(col1)){
res = c(res, sort(rep(col1[k]:col2[k],col3[k])))
}
Result:
> res
[1] 3 3 4 4 9 9 9 10 10 10 11 11 11 15 15 16 16