populate Data Frame based on lookup data frame in R - r

How does one go about switching a data frame based on column names between to tables with a lookup table in between.
Orig
A B C
1 2 3
2 2 2
4 5 6
Ret
D E
7 8
8 9
2 4
lookup <- data.frame(Orig=c('A','B','C'),Ret=c('D','D','E'))
Orig Ret
1 A D
2 B D
3 C E
So that the final data frame would be
A B C
7 7 8
8 8 9
2 2 4

We can match the 'Orig' column in 'lookup' with the column names of 'Orig' to find the numeric index (although, it is in the same order, it could be different in other cases), get the corresponding 'Ret' elements based on that. We use that to subset the 'Ret' dataset and assign the output back to the original dataset. Here I made a copy of "Orig".
OrigN <- Orig
OrigN[] <- Ret[as.character(lookup$Ret[match(as.character(lookup$Orig),
colnames(Orig))])]
OrigN
# A B C
#1 7 7 8
#2 8 8 9
#3 2 2 4
NOTE: as.character was used as the columns in 'lookup' were factor class.

I believe that the following will work as well.
OrigN <- Orig
OrigN[, as.character(lookup$Orig)] <- Ret[, as.character(lookup$Ret)]
This method applies a column shuffle to Orig (actually a copy OrigN following #Akrun) and then fills these columns with the appropriately ordered columns of Ret using the lookup.

Related

Split a data frame by rows and save as csv

I just have a data frame and want to split the data frame by rows, assign the several new data frames to new variables and save them as csv files.
a <- rep(1:5,each=3)
b <-rep(1:3,each=5)
c <- data.frame(a,b)
# a b
1 1 1
2 1 1
3 1 1
4 2 1
5 2 1
6 2 2
7 3 2
8 3 2
9 3 2
10 4 2
11 4 3
12 4 3
13 5 3
14 5 3
15 5 3
I want to split c by column a. i.e all rows are 1 in column a are split from c and assign it to A and save A as A.csv. The same to B.csv with all 2 in column a.
What I can do is
A<-c[c$a%in%1,]
write.csv (A, "A.csv")
B<-c[c$a%in%2,]
write.csv (B, "B.csv")
...
If I have 1000 rows and there will be lots of subsets, I just wonder if there is a simple way to do this by using for loop?
The split() function is very useful to split data frame. Also, you can use lapply() here - it should be more efficient than a loop.
dfs <- split(c, c$a) # list of dfs
# use numbers as file names
lapply(names(dfs),
function(x){write.csv(dfs[[x]], paste0(x,".csv"),
row.names = FALSE)})
# or use letters (max 26!) as file names
names(dfs) <- LETTERS[1:length(dfs)]
lapply(names(dfs),
function(x){write.csv(dfs[[x]],
file = paste0(x,".csv"),
row.names = FALSE)})
for(i in seq_along(unique(c$a))){
write.csv(c[c$a == i,], paste0(LETTERS[i], ".csv"))}
You should consider, however, what happens if you have more than 26 subsets. What will those files be named?

data frame column names no longer unique when subsetting

I have a data frame that contains duplicate column names. I'm aware that it's non-standard to use duplicated column names, but these names are actually being reassigned downstream using user inputs. For now, I'm attempting to positionally subset a data frame, but the column names become deduplicated. Here's an example.
> df <- data.frame(x = 1:4, y = 2:5, y = LETTERS[2:5], y = (2+(2:5)), check.names = F)
> df
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
4 4 5 E 7
However, when I attempt to subset, the names change...
> df[, 1:3]
x y y.1
1 1 2 B
2 2 3 C
3 3 4 D
4 4 5 E
Is there any way to prevent this from happening? It only occurs when I subset on columns, not rows.
> df[1:3,]
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
Edit for others noticing this behavior:
I've done some digging into the behavior and this relevant section from the help page for extract.data.frame (type ?'[')
The relevant section states:
If [ returns a data frame it will have unique (and non-missing) row
names, if necessary transforming the row names using make.unique.
Similarly, if columns are selected column names will be transformed to
be unique if necessary (e.g., if columns are selected more than once,
or if more than one column of a given name is selected if the data
frame has duplicate column names).
This explains the why, appreciate the comments so far on addressing how to best navigate this.
Here is an option, although I think it is not a good idea to have duplicated column names.
as.data.frame(as.list(df)[1:3], check.names = F)
# x y y
# 1 1 2 B
# 2 2 3 C
# 3 3 4 D
# 4 4 5 E

cbind named vectors in R by name

I have two named vectors similar to these ones:
x <- c(1:5)
names(x) <- c("a","b","c","d","e")
t <- c(6:10)
names(t) <- c("e","d","c","b","a")
I would like to combine them so to get the following outcome:
x t
a 1 10
b 2 9
c 3 8
d 4 7
e 5 6
Unfortunately when I run cbind(x,t) the result just combines them in the order they are disregarding the names of t and only keeping those of x. Giving the following result:
x t
a 1 6
b 2 7
c 3 8
d 4 9
e 5 10
I'm pretty sure there must be an easy solution, but I cannot find it. As this passage is part of a long and tedious loop (and the vectors I'm working with are much longer), it is important to have the least convoluted and quicker to compute options.
We can use the names of 'x' to change the order the 't' elements and cbind with 'x'
cbind(x, t = t[names(x)])
# x t
#a 1 10
#b 2 9
#c 3 8
#d 4 7
#e 5 6

Finding the top values in data frame using r

How can I find the 5 highest values of a column in a data frame
I tried the order() function but it gives me only the indices of the rows, wherease I need the actual data from the column. Here's what I have so far:
tail(order(DF$column, decreasing=TRUE),5)
You need to pass the result of order back to DF:
DF <- data.frame( column = 1:10,
names = letters[1:10])
order(DF$column)
# 1 2 3 4 5 6 7 8 9 10
head(DF[order(DF$column),],5)
# column names
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
You're correct that order just gives the indices. You then need to pass those indices to the data frame, to pick out the rows at those indices.
Also, as mentioned in the comments, you can use head instead of tail with decreasing = TRUE if you'd like, but that's a matter of taste.

R combine nx4 into nx2

I have a dataset that has 1 factors (4 levels). However each factor level and data is currently in its own column, with a factor level label at the top (Matrix of n by 4).
To do an anova I want to change this to a n by 2 with all the factor labels in column A and all the data in column B.
I could easily cut and paste this in Excel, then back into a csv- but assume there is a way to do this with cbind.
Sample data:
A B C D
2 4 6 8
3 5 7 9
What I require:
A 2
A 3
B 4
B 5
C 6
C 7
D 8
D 9
You should use stack:
stack(df) # where `df` is your data.frame
stack is better here but also:
library(reshape2)
melt(df)

Resources