Refer to Index in Creating Data Frame R - r

I have variables v1,v2,etc and I want to create a dataframe.
I want to avoid doing:
df <-data.frame(v1,v2,...)
I would like to refer to the index in each of the variables and do something like:
for (i in 1:n){
df <-data.frame(v[i])
}
or do a max and min:
df <-data.frame(v1 to vn)
I just can't figure out what the proper syntax is.

You can do:
as.data.frame(mget(paste0("v", 1:n)))
v1 <- 1:3
v2 <- 2:4
v3 <- 3:5
as.data.frame(mget(paste0("v", 1:3)))
# v1 v2 v3
# 1 1 2 3
# 2 2 3 4
# 3 3 4 5

Related

Add new variable to specific position in dataframe without specifying a numbered position

In Stata, I can create a variable after or before another one. E.g. gen age=., after(sex)
I would like to do the same in R. Is it possible?
My database has 300 variables, so I don't want to count it to discover its numbered position and also I might change from time to time.
You could do:
library(tibble)
data <- data.frame(a = c(1,2,3), b = c(1,2,3), c = c(1,2,3))
add_column(data, d = "", .after = "b")
# a b d c
# 1 1 1
# 2 2 2
# 3 3 3
Or another way could be:
data.frame(append(data, list(d = ""), after = match("b", names(data))))
First add the new column to the end of your data frame. Then, find the index of the column after which you want that new column to actually appear, and interpolate it:
df$new_col <- ...
index <- match("col_before", names(df))
df <- df[, c(names(df)[c(1:index)], "new_col", names(df)[c((index+1):(ncol(df)-1))])]
Sample:
df <- data.frame(v1=c(1:3), v2=c(4:6), v3=c(7:9))
df$new_col <- c(7,7,7)
index <- match("v2", names(df))
df <- df[, c(names(df)[c(1:index)], "new_col", names(df)[c((index+1):(ncol(df)-1))])]
df
v1 v2 new_col v3
1 1 4 7 7
2 2 5 7 8
3 3 6 7 9

R Difference with previous column across multiple columns

I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}

Sum Values of Every Column in Data Frame with Conditional For Loop

So I want to go through a data set and sum the values from each column based on the condition of my first column. The data and my code so far looks like this:
x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20
for(i in colnames(data)){
if(data$x>2){
x1 <-sum(data[[i]])
}
else{
x2 <-sum(data[[i]])
}
}
My assumption was that the for loop would call each column by name from the data and then sum the values in each column based on whether they matched the condition of column x.
I want to sum half the values from each column and assign them to a value x1 and do the same for the remainder, assigning it to x2. I keep getting an error saying the following:
the condition has length > 1 and only the first element will be used
What am I doing wrong and is there a better way to go about this? Ideally I want a table that looks like this:
v1 v2 v3
x1 6 7 35
x2 4 3 15
Here's a dplyr solution. First, I define the data frame.
df <- read.table(text = "x v1 v2 v3
1 0 1 5
2 4 2 10
3 5 3 15
4 1 4 20", header = TRUE)
# x v1 v2 v3
# 1 1 0 1 5
# 2 2 4 2 10
# 3 3 5 3 15
# 4 4 1 4 20
Then, I create a label (x_check) to indicate which group each row belongs to based on your criterion (x > 2), group by this label, and summarise each column with a v in its name using sum.
# Load library
library(dplyr)
df %>%
mutate(x_check = ifelse(x>2, "x1", "x2")) %>%
group_by(x_check) %>%
summarise_at(vars(contains("v")), funs(sum))
# # A tibble: 2 x 4
# x_check v1 v2 v3
# <chr> <int> <int> <int>
# 1 x1 6 7 35
# 2 x2 4 3 15
Not sure if I understood your intention correctly, but here is how you would reproduce your results with base R:
df <- data.frame(
x = c(1:4),
v1 = c(0, 4, 5, 1),
v2 = 1:4,
v3 = (1:4)*5
)
x1 <- colSums(df[df$x > 2, 2:4, drop = FALSE])
x2 <- colSums(df[df$x <= 2, 2:4, drop = FALSE])
Where
df[df$x > 2, 2:4, drop = FALSE] will create a subset of df where the rows satisfy df$x > 2 and the columns are 2:4 (meaning the second, third and fourth column), drop = FALSE is there mainly to prevent R from simplifying the results in some special cases
colSums does a by-column sum on the subsetted data.frame
If your x column was really a condition (e.g. a logical vector) you could just do
x1 <- colSums(df[df$x, 2:4, drop = FALSE])
x2 <- colSums(df[!df$x, 2:4, drop = FALSE])
Note that there is no loop needed to get to the results, with R you should use vectorized functions as much as possible.
More generally, you could do such aggregation with aggregate:
aggregate(df[, 2:4], by = list(condition = df$x <= 2), FUN = sum)

Add values to a vector to make a consecutive vector in R

I have several vectors that look like this:
v1 <- c(1,2,4)
v2 <- c(3,5,8)
v3 <- c(4)
This is just a small sample of them. I'm trying to figure out a way to add values to each of them to make them all consecutive vectors. So that at the end, they look like this:
v1 <- c(1,2,3,4)
v2 <- c(1,2,3,4,5,6,7,8)
v3 <- c(1,2,3,4)
So "3" is added to the first vector, "1","2","4","6","7" is added to the second and so forth. I have several hundred vectors that look like this so I'm trying to figure out a solution that would scale/be automated.
You can use seq and max
seq(max(v1))
For multiple vectors, we can loop
lapply(mget(paste0('v',1:3)), function(x) seq(max(x)))
#$v1
#[1] 1 2 3 4
#$v2
#[1] 1 2 3 4 5 6 7 8
#$v3
#[1] 1 2 3 4

Random subsets in function of one column in r

I want to extract n rows randomly from a data frame in function of one column.
So with this example :
# Reproducible example
df <- as.data.frame(matrix(0,2e+6,2))
df$V1 <- runif(nrow(df),0,1)
df$V2 <- sample(c(1:10),nrow(df), replace=TRUE)
df$V3 <- sample(c("A","B","C"),nrow(df), replace=TRUE)
I want to extract, for example, n=10rows for each value of V2.
# Example of what I need with one value of V2
df1 <- df[which(df$V2==1),]
str(df1)
df1[sample(1:nrow(df1),10),]
I do not want to do any for-loopso I tried this line with tapply:
df_objective <- tapply(df$V1, df$V2, function(x) df[sample(1:nrow(df),10),"V2"])
which is close to what I want but I lost the third column of the data frame.
I tried this to have complete subsets :
df_objective <- by(cbind(df$V1,df$V3), df$V2, function(x) df[sample(1:nrow(df),10),"V2"])
but it does not help.
How can I keep all the columns in the subsets ?
It sounds like you're just looking for something like sample_n from "dplyr":
library(dplyr)
df %>% group_by(V2) %>% sample_n(10)
# Source: local data frame [100 x 3]
# Groups: V2
#
# V1 V2 V3
# 1 0.51099392 1 B
# 2 0.87098866 1 A
# 3 0.13647752 1 B
# 4 0.15348834 1 B
# 5 0.94096127 1 B
# 6 0.05673849 1 A
# 7 0.69960842 1 C
# 8 0.02246671 1 C
# 9 0.88903430 1 B
# 10 0.52128253 1 A
# .. ... .. ..
Alternatively, there's stratified from my "splitstackshape" package.
library(splitstackshape)
stratified(df, "V2", 10)
You can try
library(data.table)
setDT(df)[, .SD[sample(.N, 10)] , V2]
Or a faster option as suggested by #Frank
setDT(df)[df[,sample(.I,10),V2]$V1]
You want to sample from the rows, so that should be the first arg to tapply, not V1:
myrows <- unlist(tapply(1:nrow(df),df$V2,sample,size=10))
df1[myrows,]

Resources