Vector to dataframe with variable row length [duplicate] - r

This question already has answers here:
Convert Rows into Columns by matching string in R
(3 answers)
Closed 4 years ago.
Given a vector, I want to convert it to a dataframe using a 'key' value which is randomly distributed throughout the vector at the start of what is to be a row. In this case, "z" would be the first value in each column.
vd <- c("z","a","b","c","z","a","b","c","z","a","b","c","d")
The resultant data should look like:
#using magrittr
data.frame(x1 = c("z","a","b","c", NA), x2 = c("z","a","b","c", NA), x3 = c("z","a","b","c","d"))
%>% transpose()
One solution would be to find the largest distance between 'keys' in the vector and then interject blank values at the end of 'sections' that are smaller than the longest 'section' so you could use matrix()
What would be the best way to do this?

plyr::ldply(split(vd, cumsum(vd == "z")), rbind)[-1]
(copied from here)
result:
1 2 3 4 5
1 z a b c <NA>
2 z a b c <NA>
3 z a b c d

We can use cumsum to identify groups then split them. Then we append the vectors and format them as a data.frame.
x <- split(vd,cumsum("z"==vd))
maxl <- max(lengths(x))
as.data.frame(lapply(x,function(y) c(y,rep(NA,maxl-length(y)))))
# X1 X2 X3
# 1 z z z
# 2 a a a
# 3 b b b
# 4 c c c
# 5 <NA> <NA> d

Related

Ordering data.frame by multiple variables in mixed direction [duplicate]

This question already has answers here:
How to order a data frame by one descending and one ascending column?
(12 answers)
Closed 4 years ago.
For this sample data.frame,
df <- data.frame(var1=c("b","a","b","a","a","b"),
var2=c("l","l","k","k","l","k"),
var3=c("t","t","x","t","x","x"),
var4=c(5,3,3,5,5,3),
stringsAsFactors=F)
Unsorted
var1 var2 var3 var4
1 b l t 5
2 a l t 3
3 b k x 3
4 a k t 5
5 a l x 5
6 b k x 3
I would like to sort on three columns 'var2', 'var3' and 'var4' in this order simultaneously. One column ascending and another two descending. Column names to sort are stored in variables.
sort_asc <- "var2"
sort_desc <- c("var3","var4")
What's the best way to do this in base R?
Updated details
This is the output if sorted ascending by 'var2' first (step 1) and then descending by 'var3' and 'var4' (as step 2).
var1 var2 var3 var4
a l x 5
b k x 3
b k x 3
a k t 5
b l t 5
a l t 3
But what I am looking for is doing all three sort at the same time to get this:
var1 var2 var3 var4
b k x 3
b k x 3
a k t 5
a l x 5
b l t 5
a l t 3
'var2' is ascending (k,l), within k and within l, 'var3' is descending, and similarly 'var4' is descending
To clarify, how this question is different from other data.frame ordering questions...
ordering on multiple columns
column names to order on are stored in variables
different ordering directions (asc,desc)
ordering is not step-wise (one sort after another) but rather simultaneous (all selected columns at same time)
using base R, not dplyr
Step-wise ordering (sort ascending first and then descending).
dplyr solution:
library(dplyr)
df %>%
arrange_at(sort_asc) %>%
arrange_at(sort_desc, desc)
var1 var2 var3 var4
1 a l x 5
2 b k x 3
3 b k x 3
4 a k t 5
5 b l t 5
6 a l t 3
base R solution:
With base R, if there are multiple columns (in general) use order within do.call. Here, we create the index for ascending order first, then sort it descnding with the second set of columns ('sort_desc')
i1 <- do.call(order, df[sort_asc])
df1 <- df[i1,]
i2 <- do.call(order, c(df1[sort_desc], list(decreasing = TRUE)))
df1[i2,]
var1 var2 var3 var4
5 a l x 5
3 b k x 3
6 b k x 3
4 a k t 5
1 b l t 5
2 a l t 3
Simultaneous/Sequential ordering (all ordering variables are used in one ordering step):
dplyr solution:
df %>%
arrange_(.dots = c(sort_asc, paste0("desc(", sort_desc, ")")))
# var1 var2 var3 var4
#1 b k x 3
#2 b k x 3
#3 a k t 5
#4 a l x 5
#5 b l t 5
#6 a l t 3
base R solution:
With base R, if we need the similar output as with arrange_
df[do.call(order, c(as.list(df[sort_asc]), lapply(df[sort_desc],
function(x) -xtfrm(x)))),]
# var1 var2 var3 var4
#3 b k x 3
#6 b k x 3
#4 a k t 5
#5 a l x 5
#1 b l t 5
#2 a l t 3

match multiple columns and create/update selected multiple columns [duplicate]

This question already has answers here:
Substitute DT1.x with DT2.y when DT1.x and DT2.x match in R [duplicate]
(1 answer)
merge data.frames based on year and fill in missing values
(4 answers)
Closed 5 years ago.
I would like to update the dataframe d_sub with two new columns x,y(and excluding column xy) based on the matching of the common columns(treatment,replicate) in the parent dataframe d.
set.seed(0)
x <- rep(1:10, 4)
y <- sample(c(rep(1:10, 2)+rnorm(20)/5, rep(6:15, 2) + rnorm(20)/5))
treatment <- sample(gl(8, 5, 40, labels=letters[1:8]))
replicate <- sample(gl(8, 5, 40))
d <- data.frame(x=x, y=y, xy=x*y, treatment=treatment, replicate=replicate)
d_sub <- d[sample(nrow(d),6),4:5]
d_sub
# treatment replicate
# 32 b 2
# 11 h 7
# 9 h 3
# 20 e 3
# 10 b 5
# 7 d 3
Unlike the normal merge or other methods mentioned here, I would only need to extract few columns as shown in the below expected output:
# treatment replicate x y
# 32 b 2 2 8.998847
# 11 h 7 1 5.082928
# 9 h 3 2 7.050445
# 20 e 3 10 10.145350
# 10 b 5 10 7.941056
# 7 d 3 7 6.814287
Note the exclusion of xy column in the output here! In my original problem, there are thousands of columns which I would not require in the output than the required very few columns. I am especially looking for methods other than merge to know if I can achieve the solution in a memory-efficient way.
I guess it has been asked here before, but what you are looking for is:
merge(d_sub, d, by=c("treatment", "replicate"))
or:
d_sub <- merge(d_sub, d, by=c("treatment", "replicate"))

What is the most effective way to sort dataframe and add special id? [duplicate]

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

How to replace values in multiple columns in a data.frame with values from a vector in R?

I would like to replace the values in the last three columns in my data.frame with the three values in a vector.
Example of data.frame
df
A B C D
5 3 8 9
Vector
1 2 3
what I would like the data.frame to look like.
df
A B C D
5 1 2 3
Currently I am doing:
df$B <- Vector[1]
df$C <- Vector[2]
df$D <- Vector[3]
I would like to not replace the values one by one. I would like to do it all at once.
Any help will be appreciated. Please let me know if any further information is needed.
We can subset the last three columns of the dataset with tail, replicate the 'Vector' to make the lengths similar and assign the values to those columns
df[,tail(names(df),3)] <- Vector[col(df[,tail(names(df),3)])]
df
# A B C D
#1 5 1 2 3
NOTE: I replicated the 'Vector' assuming that there will be more rows in the 'df' in the original dataset.
Try this:
df[-1] <- 1:3
giving:
> df
A B C D
1 5 1 2 3
Alternately, we could do it non-destructively like this:
replace(df, -1, 1:3)
Note: The input df in reproducible form is:
df <- data.frame(A = 5, B =3, C = 8, D = 9)

In R, extract rows based on strings in different columns

Sorry if the solution to my problem is already out there, and I overlooked it. There are a lot of similar topics which all helped me understand the basics of what I'm trying to do, but did not quite solve my exact problem.
I have a data frame df:
> type = c("A","A","A","A","A","A","B","B","B","B","B","B")
> place = c("x","y","z","x","y","z","x","y","z","x","y","z")
> value = c(1:12)
>
> df=data.frame(type,place,value)
> df
type place value
1 A x 1
2 A y 2
3 A z 3
4 A x 4
5 A y 5
6 A z 6
7 B x 7
8 B y 8
9 B z 9
10 B x 10
11 B y 11
12 B z 12
>
(my real data has 3 different values in type and 10 in place, if that makes a difference)
I want to extract rows based on the strings in columns m and n.
E.g. I want to extract all rows that contain A in type and x and z in place, or all rows with A and B in type and y in place.
This works perfectly with subset, but I want to run my scripts on different combinations of extracted rows, and adjusting the subset command every time isn't very effective.
I thought of using a vector containing as elements what to get from type and place, respectively.
I tried:
v=c("A","x","z")
df.extract <- df[df$type&df$place %in% v]
but this returns an error.
I'm a total beginner with R and programming, so please bear with me.
You could try
df[df$type=='A' & df$place %in% c('x','y'),]
# type place value
#1 A x 1
#2 A y 2
#4 A x 4
#5 A y 5
For the second case
df[df$type %in% c('A', 'B') & df$place=='y',]
Update
Suppose, you have many columns and needs to subset the dataset based on values from many columns. For example.
set.seed(24)
df1 <- cbind(df, df[sample(1:nrow(df)),], df[sample(1:nrow(df)),])
colnames(df1) <- paste0(c('type', 'place', 'value'), rep(1:3, each=3))
row.names(df1) <- NULL
You can create a list of the values from the columns of interest
v1 <- setNames(list('A', 'x', c('A', 'B'),
'x', 'B', 'z'), paste0(c('type', 'place'), rep(1:3, each=2)))
and then use Reduce
df1[Reduce(`&`,Map(`%in%`, df1[names(v1)], v1)),]
you can make a function extract :
extract<-function(df,type,place){
df[df$type %in% type & df$place %in% place,]
}
that will work for the different subsets you want to do :
df.extract<-extract(df=df,type="A",place=c("x","y")) # or just extract(df,"A",c("x","y"))
> df.extract
type place value
1 A x 1
2 A y 2
4 A x 4
5 A y 5
df.extract<-extract(df=df,type=c("A","B"),place="y") # or just extract(df,c("A","B"),"y")
> df.extract
type place value
2 A y 2
5 A y 5
8 B y 8
11 B y 11

Resources