I have a vector of length 1000. It contains (numeric) survey answers of 100 participants, thus 10 answers per participant. I would like to drop the first three values for every participant to create a new vector of length 700 (including only the answers to questions 4-10).
I only know how to extract every n-th value of the vector, but cannot figure how to solve the above problem.
vector <- seq(1,1000,1)
Expected output:
4 5 6 7 8 9 10 14 15 16 17 18 19 20 24 ...
Using a matrix to first structure and then flatten is one method. Another somewhat similar method is to use what I am calling a "logical pattern index":
head( # just showing the first couple of "segments"
vector[ c( rep(FALSE, 3), rep(TRUE, 10-3) ) ],
15)
[1] 4 5 6 7 8 9 10 14 15 16 17 18 19 20 24
This method can also be use inside the two argument version of [ to select rows ore columns using a logical pattern index. This works because of R's recycling of logical indices.
Thanks for providing example data, based on which this thread is reproducible. Here is one solution
c(matrix(vector, 10)[4:10, ])
We first convert the vector to a matrix with 10 rows, so that each column attributes to a participant. Then use row subsetting to remove first three rows. Finally the matrix is flattened to a vector again.
Related
Say I have a data.frame:
df <- data.frame(A=c(10,20,30),B=c(11,22,33), C=c(111,222,333))
A B C
1 10 11 111
2 20 22 222
3 30 33 333
If I select two (or more) columns I get a data.frame:
x <- df[,1:2]
A B
1 10 11
2 20 22
3 30 33
This is what I want. However, if I select only one column I get a numeric vector:
x <- df[,1]
[1] 1 2 3
I have tried to use as.data.frame(), which does not change the results for two or more columns. it does return a data.frame in the case of one column, but does not retain the column name:
x <- as.data.frame(df[,1])
df[, 1]
1 1
2 2
3 3
I don't understand why it behaves like this. In my mind it should not make a difference if I extract one or two or ten columns. IT should either always return a vector (or matrix) or always return a data.frame (with the correct names). what am I missing? thanks!
Note: This is not a duplicate of the question about matrices, as matrix and data.frame are fundamentally different data types in R, and can work differently with dplyr. There are several answers that work with data.frame but not matrix.
Use drop=FALSE
> x <- df[,1, drop=FALSE]
> x
A
1 10
2 20
3 30
From the documentation (see ?"[") you can find:
If drop=TRUE the result is coerced to the lowest possible dimension.
Omit the ,:
x <- df[1]
A
1 10
2 20
3 30
From the help page of ?"[":
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
A data frame is a list. The columns are its elements.
You can also use subset:
subset(df, select = 1) # by index
subset(df, select = A) # by name
As mentioned in the comments you can also use dplyr::select, but you do not need to quote the variable name:
library(dplyr)
# by name
df %>%
select(A)
# by index
df %>%
select(1)
My question is related to R.
I have code snippet related to 5 answer choice. When I run this answer choice every choice except one get error. The right one also did not match with the question.
My question is
A B C D E
1 7 4 23 68 15
2 12 53 14 10 20
3 39 88 98 50 84
4 18 38 33 47 72
5 31 6 51 38 27
6 20 15 68 99 50
This dataframe is given. To create this data frame I write the following code block.
A = c(7,12,39,18,31,20)
B = c(4,53,88,38,6,15)
C = c(23,14,98,33,51,68)
D = c(68,10,50,47,38,99)
E = c(15,20,84,72,27,50)
df_x = data.frame(A,B,C,D,E)
Question: Which of the following R code will sunset data frame df_x,returning the final three rows?
My answer choice is
df_x[nrow(df_x)-2:nrow(df_x)]
df_x[(nrow(df_x)-2):nrow(df_x)]
df_x[nrow(df-x)-2:,]
df_x[-3:]
df_x[(nrow(df_x)-2):nrow(df_x)
Among them only the 1st choice df_x[nrow(df_x)-2:nrow(df_x)] some output.
Output:
D C B A
1 68 23 4 7
2 10 14 53 12
3 50 98 88 39
4 47 33 38 18
5 38 51 6 31
6 99 68 15 20
I think this is not the correct one. All other choices give error. Can any one tell me which one is the correct choice? Or what is the actual query to answer the following question? I am new to R. So it is hard for me to find out the correct one.
df_x[(nrow(df_x)-2):nrow(df_x),]
Keep in mind, convention is df[rows, columns]. And you need to specify both arguments, which is why I put a comma after the row argument in the solution
Cheers,
Joe
The answers in those choices will produce errors because they are not creating the indexes properly.
In R, when you are subsetting database, you need to give the row numbers and the column numbers.
For example,df[row,col] will give you the data that is the given row and the given column. df[row,] will select all columns for the given row number.
If you don't put a comma (,) in the index, you are only selecting the columns. For e.gdf[1:2] is going to select the first and second columns
If you want to select multiple rows or multiple columns, you can put the numbers in as well e.g df[1:3,3:9]
When you use -, R removes the given row or column. So for example, df[-1,] removes the first row. df[,-3] removes the third column. df[-1:-5,] removes the first five rows.
Those answers all have errors in them because they don't have commas in the right places. If you want to select up to the last row or column in R, you need to give the last row or column number. You get this number by using nrow(df) or ncol(df). Using the : is the Python way of doing things.
The closest answer here is: df_x[(nrow(df_x)-2):nrow(df_x)] but you need to add a comma: df_x[(nrow(df_x)-2):nrow(df_x),]
The problem you are being expected to recognize (but have not) is operator precedence. The colon operator (for sequencing) has a higher precedence than the binary minus operator, so the expression: nrow(df_x)-2:nrow(df_x) gives you the vector difference possibly with recycling of the value of nrow(df_x) and the vector 2:nrow(df_x). So option number 2 which isolates nrow(df_x)-2 from the colon-operator with parentheses will give you the correct index. Adding parentheses to make terms obvious is good programming practice. See:
?Syntax
The other problem is that there is a missing comma after those expressions ... I think your course text should have given option 2 as
df_x[(nrow(df_x)-2):nrow(df_x),]
Suppose a data frame df has a column speed, then what is difference in the way accessing the column like so:
df["speed"]
or like so:
df$speed
The following calculates the mean value correctly:
lapply(df["speed"], mean)
But this prints all values under the column speed:
lapply(df$speed, mean)
There are two elements to the question in the OP. The first element was addressed in the comments: df["speed"] is an object of type data.frame() whereas df$speed is a numeric vector. We can see this via the str() function.
We'll illustrate this with Ezekiel's 1930 analysis of speed and stopping distance, the cars data set from the datasets package.
> library(datasets)
> data(cars)
>
> str(cars["speed"])
'data.frame': 50 obs. of 1 variable:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
> str(cars$speed)
num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
>
The second element that was not addressed in the comments is that lapply() behaves differently when passed a vector versus a list().
With a vector, lapply() processes each element in the vector independently, producing unexpected results for a function such as mean().
> unlist(lapply(cars$speed,mean))
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
[26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
What happened?
Since each element of cars$speed is processed by mean() independently, lapply() returns a list of 50 means of 1 number each: the original elements in the cars$speed vector.
Processing a list with lapply()
With a list, each element of the list is processed independently. We can calculate how many items will be processed by lapply() with the length() function.
> length(cars["speed"])
[1] 1
>
Since a data frame is also a list() that contains one element of type data.frame(), the length() function returns the value 1. Therefore, when processed by lapply(), a single mean is calculated, not one per row of the speed column.
> lapply(cars["speed"],mean)
$speed
[1] 15.4
>
If we pass the entire cars data frame as the input object for lapply(), we obtain one mean per column in the data frame, since both variables in the data frame are numeric.
> lapply(cars,mean)
$speed
[1] 15.4
$dist
[1] 42.98
>
A theoretical perspective
The differing behaviors of lapply() are explained by the fact that R is an object oriented language. In fact, John Chambers, creator of the S language on which R is based, once said:
In R, two slogans are helpful.
-- Everything that exists is an object, and
-- Everything that happens is a function call.
John Chambers, quoted in Advanced R, p. 79.
The fact that lapply() works differently on a data frame than a vector is an illustration of the object oriented feature of polymorphism where the same behavior is implemented in different ways for different types of objects.
While this looks like an beginner's question I think it's worth answering it since many beginners could have a similar question and a guide to the corresponding documentation is helpful IMHO.
No up-votes please - I am just collecting the comment fragments from the question that contribute to the answer - feel free to edit this answer...*
A data.frame is a list of vectors with the same length (number of elements). Please read the help in the R console (by typing ?data.frame)
The $ operator is implemented by returning one column as vector (?"$.data.frame")
lapply applies a function to each element of a list (see ?lapply). If the first param X is a scalar vector (integer, double...) with multiple elements, each element of the vector is converted ("coerced") into one separate list element (same as as.list(1:26))
Examples:
x <- data.frame(a = LETTERS, b = 1:26, stringsAsFactors = FALSE)
b.vector <- x$b
b.data.frame <- x["b"]
class(b.vector) # integer
class(b.data.frame) # data.frame
lapply(b.vector, mean)
# returns a result list with 26 list elements, the same as `lapply(1:26, mean)`
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
# ... up to list element 26
lapply(b.data.frame, mean)
# returns a list where each element of the input vector in param X
# becomes a separate list element (same as `as.list(1:26)`)
# $b
# [1] 13.5
So IMHO your original question can be reduced to: Why is lapply behaving differently if the first parameter is a scalar vector instead of a list?
I have noticed a curious quirk in R. Let's say I have a vector and I want to use logical indices where I intend to just refer to the first two elements.
vec <- rep(2, 10)
vec[c(TRUE,FALSE)] <- 13
My output now shows 13 assigned in every other value
vec
[1] 13 2 13 2 13 2 13 2 13 2
Now, I know I could just use the numeric indices (for example wrapping the logical values with a which call) but I am curious about this.
Why does R repeat logical values when indexing a vector?
Assume I have three matrices...
A=matrix(c("a",1,2),nrow=1,ncol=3)
B=matrix(c("b","c",3,4,5,6),nrow=2,ncol=3)
C=matrix(c("d","e","f",7,8,9,10,11,12),nrow=3,ncol=3)
I want to find all possible combinations of column 1 (characters or names) while summing up columns 2 and 3. The result would be a single matrix with length equal to the total number of possible combinations, in this case 6. The result would look like the following matrix...
Result <- matrix(c("abd","abe","abf","acd","ace","acf",11,12,13,12,13,14,17,18,19,18,19,20),nrow=6,ncol=3)
I do not know how to add a table in to this question, otherwise I would show it more descriptively. Thank you in advance.
You are mixing character and numeric values in a matrix and this will coerce all elements to character. Much better to define your matrix as numeric and keep the character values as the row names:
A <- matrix(c(1,2),nrow=1,dimnames=list("a",NULL))
B <- matrix(c(3,4,5,6),nrow=2,dimnames=list(c("b","c"),NULL))
C <- matrix(c(7,8,9,10,11,12),nrow=3,dimnames=list(c("d","e","f"),NULL))
#put all the matrices in a list
mlist<-list(A,B,C)
Then we use some Map, Reduce and lapply magic:
res <- Reduce("+",Map(function(x,y) y[x,],
expand.grid(lapply(mlist,function(x) seq_len(nrow(x)))),
mlist))
Finally, we build the rownames
rownames(res)<-do.call(paste0,expand.grid(lapply(mlist,rownames)))
# [,1] [,2]
#abd 11 17
#acd 12 18
#abe 12 18
#ace 13 19
#abf 13 19
#acf 14 20