I have an example data frame as shown below.
> x=data.frame(id=1:5,c1=letters[1:5],c2=letters[13:17])
> x
id c1 c2
1 1 a m
2 2 b n
3 3 c o
4 4 d p
5 5 e q
I want to create a vector out of this data frame which selects a different column for each row based on another vector. So if that vector is
> vars
[1] 1 2 2 1 1
>
I want for the 1st row in x, column 1, for the second row in x, column 2 and so on. So the expected output vector (or data frame) would be
if vector
a n o d e
if data frame
id V1
1 a
2 n
3 o
4 d
5 e
Any help, much appreciated.
You can 'slice' a data frame using a matrix:
y=data.frame(1:5,c(1,2,2,1,1))
x[2:3][as.matrix(y)]
result:
[1] "a" "n" "o" "d" "e"
Let's generalise this by creating a function
selector=function(x)matrix(c(seq_along(x),x),ncol=2)
Note that there is one column to be ignored at the start, so add 1 to your select vector v
v=c(1,2,2,1,1)
x[selector(v+1)]
result
[1] "a" "n" "o" "d" "e"
Related
I have a procedure that can extract items from a data frame based on a list of conditions on the columns (see Extracting items from an R data frame using criteria given as a (column_name = value) list):
Here are the data frame and condition list:
> experimental_plan_1
lib genotype treatment replicate
1 A WT normal 1
2 B WT hot 1
3 C mut normal 1
4 D mut hot 1
5 E WT normal 2
6 F WT hot 2
7 G mut normal 2
8 H mut hot 2
> condition_1 <- list(genotype="WT", treatment="normal")
My goal is to extract the values in the lib column for lines corresponding to criteria given in the list.
I can use the following function to extract the wanted values:
> get_libs <- function(experimental_plan, condition) {experimental_plan[apply((experimental_plan[, names(condition)] == condition), 1, all), "lib"]}
This works well with the above data frame:
> get_libs(experimental_plan_1, condition_1)
[1] A E
Levels: A B C D E F G H
However, I would like for this to to be more general: My experimental_plan and condition could have different columns:
> experimental_plan_2
lib genotype replicate
1 A WT 1
2 B WT 2
3 C WT 3
4 D mut 1
5 E mut 2
6 F mut 3
> condition_2 <- list(genotype="WT")
This time it fails:
> get_libs(experimental_plan_2, condition_2)
Error in apply((experimental_plan[, names(condition)] == condition), 1, :
dim(X) must have a positive length
In this case, the expected output should be:
[1] A B C
Levels: A B C D E F
How can I write a function that performs the same thing in a more robust manner?
Comment
I find it quite frustrating that the function does not work despite both cases being highly similar: both data frames have a lib column, and in both cases the names in the condition list correspond to column names in the data frame.
R apparently automatically converts a data.frame to a factor when the number of columns extracted from the data frame is reduced to one:
> class(experimental_plan_1)
[1] "data.frame"
> class(experimental_plan_2)
[1] "data.frame"
> class(names(condition_1))
[1] "character"
> class(names(condition_2))
[1] "character"
> class(experimental_plan_1[, names(condition_1)])
[1] "data.frame"
> class(experimental_plan_2[, names(condition_2)])
[1] "factor"
This goes against the principle of least surprise.
I would expect a computation to return the same type of output when the same type of inputs are given.
I have a data.frame df as mentioned below
V1 V2
4 b c
14 g h
10 d g
6 b f
2 a e
5 b e
12 e f
1 a b
3 a f
9 c h
11 d h
7 c d
8 c g
13 f g
The first column is the row.names column so just ignore that. Now see the second columns V1 and V2. I want to find unique elements present in the columns V1 and V2. So if you see V1 the unique elements are b,g,d,a,e,c,f and in V2 the unique elements are c,h,g,f,e,b,d. Now if you look in these unique elements listed above. Even come elements are common in V1 and V2 i.e b,g,d,a,e,c and f.
So i need to make a new data.frame which has one column which lists all unique elements considering both V1 and V2. By unique elements i mean elements which are either present in V1 or in V2 or in both but they should not be listed repeatedly in this new data.frame so the data.frame i want is listed below. It would be better if the list is sorted alphabetically (if values are alphabets like a,b,c,d... or in ascending order if elements are 1,2,3,..
UniqueValues
a
b
c
d
e
f
g
h
Suppose this new data.frame is called UV1 and i have a similar data.frame with one column with the number of rows being either same or greater or lesser and is called UV2(and is a result of a similar operation between other columns of a different data.frame like above), so can i compare these 2 data.frames i.e compare UV1 and UV2 and find values which are same in both these data.frames and values which are not same in both these data.frames and save them in 2 different data.frames like in a (similarValuesdf) and in a (differentValuesdf) data.frames?
I am a beginner so i would prefer easier code rather than 5-10 operations being performed in a single statements like i have seen in other replies. I understand that is for time saving and those pros would have gone through a lot to figure out that one two lines of code to perform the whole thing but I am just trying to learn so would really appreciate easier code.
Thanks in advance.
I think the neatest way to do this would be to join the columns into a single vector and then use the unique() function:
unique(c(DF$V1,DF$V2))
[1] "b" "g" "d" "a" "e" "c" "f" "h"
To arrange in alphabetical order sort() would be a good choice.
x ->unique(c(DF$V1,DF$V2))
sort(x)
[1] "a" "b" "c" "d" "e" "f" "g" "h"
Let's recreate your data:
DF <- read.table(text = " V1 V2
4 b c
14 g h
10 d g
6 b f
2 a e
5 b e
12 e f
1 a b
3 a f
9 c h
11 d h
7 c d
8 c g
13 f g", header = TRUE, stringsAsFactors = FALSE)
Unlist the two columns into one vector and find unique values in that vector:
u1 <- unique(unlist(DF[, c("V1", "V2")]))
sort(u1)
#[1] "a" "b" "c" "d" "e" "f" "g" "h"
A second vector:
u2 <- c("d", "e", "f")
Find the intersection:
intersect(u1, u2)
#[1] "d" "e" "f"
Find the set difference:
setdiff(u1, u2)
#[1] "b" "g" "a" "c" "h"
To get your unique values:
UniqueValues = sort(union(unique(df$V1), unique(df$V2)))
To get the intersection of two data.frame you can try:
df1 = data.frame(col1=c(1,4,6,8))
df2 = data.frame(col1=c(6,4,8,9))
similarValuesdf = merge(df1, df2)
# col1
#1 4
#2 6
#3 8
df_new = unique(append(df$V, df$V2, after = length(df$V1)))
I have a data set look like this:
a<-c(1,1,1,2,2,2,3,3,3)
b<-rep(1:3,3)
c<-c(rep(c("i","j","k","l"),2),"o")
d<-data.frame(a,b,c)
which gives:
a b c
1 1 1 i
2 1 2 j
3 1 3 k
4 2 1 l
5 2 2 i
6 2 3 j
7 3 1 k
8 3 2 l
9 3 3 o
I am looking for a way to transform c into the following form:
1 2 3
1 i j k
2 l i j
3 k l o
So basically I hope to use a as the row index, b as column index, then transform the column c to a matrix. Is there any way this could be done efficiently by using data.table or other packages?
Thanks a lot guys!
#doscendo's solution is pretty clean; you just have to make sure the data frame is sorted properly. Here is a slightly more generic version that uses a matrix index to create what you're after and will work both if the data frame doesn't specify every value, or if a value is specified more than once (last value prevails), or if the data isn't sorted (although of course for the last one you can always sort):
mx <- with(d, matrix(ncol=max(a), nrow=max(b)))
mx[as.matrix(d[1:2])] <- as.character(d$c)
[,1] [,2] [,3]
[1,] "i" "j" "k"
[2,] "l" "i" "j"
[3,] "k" "l" "o"
As docendo said you can use just c for row-major layout
matrix(c, 3, byrow=T)
Or if you desperately desire to use a and b vectors i suggest next trick
mymatrix = matrix(rep(0, 9), 3) # creating init matrix of required size
for(i in seq_along(a)){mymatrix[a[i],b[i]]=c[i]} # fill out the matrix
Given this table
df <- data.frame(col1 = c(letters[3:5], "b","a"),
col2 = c(2:3, 1,1,1))
How can I tell R to return "a".
That means, from the three characters with value of 1 (a tie for the lowest value), I want to select only the first in alphabetical order
I think you want order
with(df, col1[order(col2, col1)][1])
# [1] a
# Levels: a b c d e
or
as.character(with(df, col1[order(col2, col1)][1]))
# [1] "a"
You can order column 1 by the ordered values in column 2 with
df[with(df, order(col2, col1)),]
# col1 col2
# 5 a 1
# 4 b 1
# 3 e 1
# 1 c 2
# 2 d 3
Try:
> min(as.character(df[df$col2==min(df$col2),1]))
[1] "a"
For explanation:
# first find col1 list in rows with minimum of df$col2
> xx = df[df$col2==min(df$col2),1]
> xx
[1] e b a
Levels: a b c d e
# Now find the minimum amongst these after converting factor to character:
> min(as.character(xx))
[1] "a"
>
Say I have a data frame containing time-series data, where the first column is the index, and the remaining columns all contain different data streams, and are named descriptively, as in the following example:
temps = data.frame(matrix(1:20,nrow=2,ncol=10))
names(temps) <- c("flr1_dirN_areaA","flr1_dirS_areaA","flr1_dirN_areaB","flr1_dirS_areaB","flr2_dirN_areaA","flr2_dirS_areaA","flr2_dirN_areaB","flr2_dirS_areaB","flr3_dirN_areaA","flr3_dirS_areaA")
temps$Index <- as.Date(2013,7,1:2)
temps
flr1_dirN_areaA flr1_dirS_areaA ... Index
1 1 3 ... 1975-07-15
2 2 4 ... 1975-07-16
Now I want to prep the data frame for plotting with ggplot2, and i want to include the three factors: flr, dir, and area.
I can achieve this for this simple example as follows:
temps.m <- melt(temps,"Index")
temps.m$flr <- factor(rep(1:3,c(8,8,4)))
temps.m$dir <- factor(rep(c("N","S"),each=2,len=20))
temps.m$area <- factor(rep(c("A","B"),each=4,len=20))
temps.m
Index variable value flr dir area
1 1975-07-15 flr1_dirN_areaA 1 1 N A
2 1975-07-16 flr1_dirN_areaA 2 1 N A
3 1975-07-15 flr1_dirS_areaA 3 1 S A
4 1975-07-16 flr1_dirS_areaA 4 1 S A
5 1975-07-15 flr1_dirN_areaB 5 1 N B
6 1975-07-16 flr1_dirN_areaB 6 1 N B
7 1975-07-15 flr1_dirS_areaB 7 1 S B
8 1975-07-16 flr1_dirS_areaB 8 1 S B
9 1975-07-15 flr2_dirN_areaA 9 2 N A
10 1975-07-16 flr2_dirN_areaA 10 2 N A
11 1975-07-15 flr2_dirS_areaA 11 2 S A
12 1975-07-16 flr2_dirS_areaA 12 2 S A
13 1975-07-15 flr2_dirN_areaB 13 2 N B
14 1975-07-16 flr2_dirN_areaB 14 2 N B
15 1975-07-15 flr2_dirS_areaB 15 2 S B
16 1975-07-16 flr2_dirS_areaB 16 2 S B
17 1975-07-15 flr3_dirN_areaA 17 3 N A
18 1975-07-16 flr3_dirN_areaA 18 3 N A
19 1975-07-15 flr3_dirS_areaA 19 3 S A
20 1975-07-16 flr3_dirS_areaA 20 3 S A
In reality, I have data streams (columns) of varying lengths - each of which comes from its own file, has missing data, more than 3 factors encoded in the column (file) names, so this simple method of applying factors won't work. I need something more robust, and I'm inclined to parse the variable names into the different factors, and populate the factor-columns of the melted data frame.
My end goal is to plot something like this:
ggplot(temps.m,aes(x=Index,y=value,color=area,linetype=dir))+geom_line()+facet_grid(flr~.)
I imagine that the reshape, reshape2, plyr, or some other package can do this in one or two statements - but I struggle with melt/cast/ddply and the rest of them. Any suggestions?
Also, if you can suggest an entirely different [better] approach to structuring my data, I'm all ears.
Thanks in advance
You can use some regular expressions to creates your factors:
res <- do.call(rbind,strsplit(gsub('flr([0-9]+).*dir([A-Z]).*area([A-Z])',
'\\1,\\2,\\3',
temps.m$variable),
','))
[,1] [,2] [,3]
[1,] "1" "N" "A"
[2,] "1" "N" "A"
[3,] "1" "S" "A"
[4,] "1" "S" "A"
[5,] "1" "N" "B"
[6,] "1" "N" "B"
[7,] "1" "S" "B"
[8,] "1" "S" "B"
........
Maybe you need further step to transform your columns to factors.
res <- colwise(as.factor)(data.frame(res))
X1 X2 X3
1 1 N A
2 1 N A
3 1 S A
4 1 S A
........
To combine the result with your melted data you can use cbind
temps.m <- cbind(temps.m,res)
Here's a way to turn a bunch of appropriately-formatted strings into a data frame of factor variables. This assumes the factors are split by _, and the last character in each substring is the desired level.
require(plyr)
v <- do.call(rbind, strsplit(as.character(temps.m$variable), "_"))
v <- alply(v, 2, function(x) {
n <- nchar(x)
name <- substr(x, 1, n - 1)[1]
lev <- substr(x, n, n)
structure(factor(lev), name=name)
})
names(v) <- sapply(v, attr, "name")
temps.m <- cbind(temps.m, as.data.frame(v))
Adding more generality is left as an exercise for the reader.