Changing the length of an object - r

Alpha is an object of length 10, then
alpha <- alpha[2 * 1:5]
Makes it an object of length 5 consisting of just the former components with even index.
How is this working?
Also when running the code, the entire object contains only NA. Is there anyway of retaining the original values?
I added elements and still it showed NA.

Hard to give a proper answer without knowing the content of alpha and what exactly you're trying to accomplish, but hope this helps.
Squared brackets are used for indexing:
alpha <- seq(10, 50, 10)
> alpha
[1] 10 20 30 40 50
> alpha[2]
[1] 20
If there's nothing in a position (e.g. if that position don't exist in the vector), it will return "Not Available" (NA):
> alpha[6]
[1] NA
> alpha[3:7]
[1] 30 40 50 NA NA
If you want to add new values to the vector, you specify the position(s) and attribute the value(s):
alpha[8:12] <- 8:12
> alpha
[1] 10 20 30 40 50 NA NA 8 9 10 11 12
# the positions with no values atributed are filled with NA
If you want to make an operation in only some positions, you specify the positions and make the operation over it:
> 2*alpha[4:8]
[1] 80 100 NA NA 16
# positions 4, 5, 6, 7, and 8 multiplied by 2
Which is different of using an operation to select positions:
> alpha[2*4:8]
[1] 8 10 12 NA NA
# showing the content of positions 8, 10, 12, 14 and 16

Related

Why is the for loop returning NA vectors in some positions (in R)?

Following a youtube tutorial, I have created a vector x [-3,6,2,5,9].
Then I create an empty variable of length 5 with the function 'numeric(5)'
I want to store the squares of my vector x in 'Storage2' with a for loop.
When I do the for loop and update my variable, it returns a very strange thing:
[1] 9 4 0 9 25 36 NA NA 81
I can see all numbers in x have been squared, but the order is so random, and there's more than 5.
Also, why are there NAs?? If it's because the last number of x is 9 (and so this number defines the length??), and there's no 7 and 8 position, I would understand, but then I'm also missing positions 1, 3 and 4, so there should be more NAs...
I'm just starting with R, so please keep it simple, and correct me if I'm wrong during my thought process! Thank you!!
x <- c(-3,6,2,5,9)
Storage2 <- numeric(5)
for(i in x){
Storage2[i] <- i^2
}
Storage2
# [1] 9 4 0 9 25 36 NA NA 81
You're looping over the elements of x not over the positions as probably intended. You need to change your loop like so:
for(i in 1:length(x)) {
Storage2[i] <- x[i]^2
}
Storage2
# [1] 9 36 4 25 81
(Note: 1:length(x) can also be expressed as seq_along(x), as pointed out by #NelsonGon in comments and might be faster.)
However, R is a vectorized language so you can simply do that:
Storage2 <- x^2
Storage2
# [1] 9 36 4 25 81

What type returns table in R?

I wrote this lines of code below.
I want to get the most frequent value in matrix:
matrix7 <- matrix(sample(1:36, 100, replace = TRUE), nrow = 1)
t <- table(matrix7)
print(t)
a <- which.max(table(matrix7))
print(unlist(a))
it prints this:
> matrix7 <- matrix(sample(1:36, 100, replace = TRUE), nrow = 1)
> t <- table(matrix7)
> print(t)
matrix7
1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 34 35 36
4 5 1 5 2 5 1 3 1 4 2 2 2 5 5 1 3 7 2 3 2 3 2 1 4 4 2 2 2 5 2 5 3
> a <- which.max(table(matrix7))
> print(unlist(a))
19
18
>
What type is my t variable and a variable,
and how can I get the most frequent value from matrix?
To know the "type" of variable use:
class(t)
class(a)
But notice you are already setting your matrix7 as table here: t <- table(matrix7) while your variable a is an integer.
To get the most common element on your variable (t in your case):
sort(table(as.vector(t)))
In general, if you want to know the "type" (more properly called the class) of an object, use the function class:
> class(t)
[1] "table"
There are a few ways you can find the most frequent value. Given that you have already calculated the which.max, you can take the corresponding name of t:
> as.numeric(names(t)[a])
[1] 5 ## I have a different random number seed to you :)
Note that you can't just take t[a] since that might return an integer code (factors are integers underneath, and the integer might not be what you expect).
In your example, the object a is an integer vector of length one. The "data" is 18, and it has the "name" 19. Hence another and perhaps simpler way to get the most frequent value is to take names(a).
You can either use class() to get the the class attribute of an R object or typeof() to get the type or storage mode.
Class and type of a are 'integer', the class of t is 'table' and the type is 'integer'.
Note that a is a named integer, this is why 2 values are printed. If you use names(a) it will only return the value (as a character) of a.
If you use which.max(tabulate(matrix7)) it will return the value without the need to change it further.
which.max(tabulate(matrix7))
[1] 16
(Side node: since no seed is in your code the result differs, you can set it using set.seed(x) where x is an integer).

How do I create a column using values of a second column that meet the conditions of a third in R?

I have a dataset Comorbidity in RStudio, where I have added columns such as MDDOnset, and if the age at onset of MDD < the onset of OUD, it equals 1, and if the opposite is true, then it equals 2. I also have another column PhysDis that has values 0-100 (numeric in nature).
What I want to do is make a new column that includes the values of PhysDis, but only if MDDOnset == 1, and another if MDDOnset==2. I want to make these columns so that I can run a t-test on them and compare the two groups (those with MDD prior OUD, and those who had MDD after OUD with regards to which group has a greater physical disability score). I want any case where MDDOnset is not 1 to be NA.
ttest1 <-t.test(Comorbidity$MDDOnset==1, Comorbidity$PhysDis)
ttest2 <-t.test(Comorbidity$MDDOnset==2, Comorbidity$PhysDis)
When I did the t test twice, once where MDDOnset = 1 and another when it equaled 2, the mean for y (Comorbidity$PhysDis) was the same, and when I looked into the original csv file, it turned out that this mean was the mean of the entire column, and not just cases where MDDOnset had a value of one or two. If there is a different way to run the t-tests that would have the mean of PhysDis only when MDDOnset = 1, and another with the mean of PhysDis only when MDDOnset == 2 that does not require making new columns, then please tell me.. Sorry if there are any similar questions or if my approach is way off, I'm new to R and programming in general, and thanks in advance.
Here's a smaller data frame where I tried to replicate the error where the new columns have switched lengths. The issue would be that the length of C would be 4, and the length of D would be 6 if I could replicate the error.
> A <- sample(1:10)
> B <-c(25,34,14,76,56,34,23,12,89,56)
> alphabet <-data.frame(A,B)
> alphabet$C <-ifelse(alphabet$A<7, alphabet$B, NA)
> alphabet$D <-ifelse(alphabet$A>6, alphabet$B, NA)
> print(alphabet)
A B C D
1 7 25 NA 25
2 9 34 NA 34
3 4 14 14 NA
4 2 76 76 NA
5 5 56 56 NA
6 10 34 NA 34
7 8 23 NA 23
8 6 12 12 NA
9 1 89 89 NA
10 3 56 56 NA
> length(which(alphabet$C>0))
[1] 6
> length(which(alphabet$D>0))
[1] 4
I would use the mutate command from the dplyr package.
Comorbidity <- mutate(Comorbidity, newColumn = (ifelse(MDDOnset == 1, PhysDis, "")), newColumn2 = (ifelse(MDDOnset == 2, PhysDis, "")))

Merge with replacement based on multiple non-unique columns

I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).
I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.
Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.
My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.
Here's some example data to illustrate:
original <- data.frame(x = 1:10, y = 23:32, value = 120:129)
x y value
1 1 23 120
2 2 24 121
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 127
9 9 31 128
10 10 32 129
And the dataframe with updated differences:
update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)
x y value
1 1 2 50
2 2 24 51
3 3 17 52
4 4 23 53
5 8 30 54
The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:
original_updated <- data.frame(x = 1:10, y = 23:32,
value = c(120, 51, 122:126, 54, 128:129))
x y value
1 1 23 120
2 2 24 51
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 54
9 9 31 128
10 10 32 129
I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.
One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.
But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?
First merge in a way which guarantees all values from the original will be present:
merged = merge(original, update, by = c("x","y"), all.x = TRUE)
Then use dplyr to choose update's values where possible, and original's value otherwise:
library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)
The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:
original[ match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)] ,
"value"] <-
update[ which( match(update$x, original$x) == match(update$y, original$y)),
"value"]
You can see the difference:
> match(update$x, original$x)[
match(update$x, original$x) ==
match(update$y, original$y) ]
[1] NA 2 NA 8
> match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)]
[1] 2 8
The "interior" match functions are returning:
> match(update$y, original$y)
[1] NA 2 NA 1 8
> match(update$x, original$x)
[1] 1 2 3 4 8

How do I repeat only a part of a vector?

I have a vector of: 0,24,12,12,12,96,12,12,12,12,12,12.
I want to repeat only a part of it from 96 to the last element (12). The first part (0, 24, 12, 12, 12) I want to keep constant.
Could you please help ?
The answer depends on whether number 96 is always located at the 6th position inside your vector. If so, please refer to the first comment underneath your question. If the position is variable, however, you could implement a simple query that identifies the position of 96 inside your vector, and then repeat the part of the vector starting from there as often as you wish (2 times in the below-mentioned code).
x <- c(0,24,12,12,12,96,12,12,12,12,12,12)
# Identify index of 96
id <- which(x == 96)
# Repeat part of vector starting from `id` 2 times
c(x[1:(id-1)], rep(x[id:length(x)], 2))
# # Which results in
# [1] 0 24 12 12 12 96 12 12 12 12 12 12 96 12 12 12 12 12 12

Resources