R- Perform operations on column and place result in a different column, with the operation specified by the output column's name - r

I have a dataframe with 3 columns- L1, L2, L3- of data and empty columns labeled L1+L2, L2+L3, L3+L1, L1-L2, etc. combinations of column operations. Is there a way to check the column name and perform the necessary operation to fill that new column with data?
I am thinking:
-use match to find the appropriate original columns and using a for loop to iterate over all of the columns in this search?
so if the column I am attempting to fill is L1+L2 I would have something like:
apply(dataframe[,c(i, j), 1, sum)

It seems strange that you would store your operations in your column names, but I suppose it is possible to achieve:
As always, sample data helps.
## Creating some sample data
mydf <- setNames(data.frame(matrix(1:9, ncol = 3)),
c("L1", "L2", "L3"))
## The operation you want to do...
morecols <- c(
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "+")),
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "-"))
)
## THE FINAL SAMPLE DATA
mydf[, morecols] <- NA
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 NA NA NA NA NA NA
# 2 2 5 8 NA NA NA NA NA NA
# 3 3 6 9 NA NA NA NA NA NA
One solution could be to use eval(parse(...)) within lapply to perform the calculations and store them to the relevant column.
mydf[morecols] <- lapply(names(mydf[morecols]), function(x) {
with(mydf, eval(parse(text = x)))
})
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 5 8 11 -3 -6 -3
# 2 2 5 8 7 10 13 -3 -6 -3
# 3 3 6 9 9 12 15 -3 -6 -3

dfrm <- data.frame( L1=1:3, L2=1:3, L3=3+1, `L1+L2`=NA,
`L2+L3`=NA, `L3+L1`=NA, `L1-L2`=NA,
check.names=FALSE)
dfrm
#------------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 NA NA NA NA
2 2 2 4 NA NA NA NA
3 3 3 4 NA NA NA NA
#-------------
dfrm[, 4:7] <- lapply(names(dfrm[, 4:7]),
function(nam) eval(parse(text=nam), envir=dfrm) )
dfrm
#-----------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 2 5 5 0
2 2 2 4 4 6 6 0
3 3 3 4 6 7 7 0
I chose to use eval(parse(text=...)) rather than with, since the use of with is specifically cautioned against in its help page. I'm not sure I can explain why the eval(..., target_dfrm) form should be any safer, though.

Related

Remove NA's by keeping all the populated cells in new columns using R

How can I drop all the elements with missing values but instead of deleting entire columns, create columns with just the populated cells? For example getting from this
A B C D
1 NA 2 NA
NA 3 NA 4
NA 5 6 NA
(data1) in order to create a data-set containing only the populated cells, as this
AB BB
1 2
3 4
5 6
Below I have created a small working example to test a solution.
># Create example dataset (data1)
>data1 <- data.frame(matrix(c(1,NA,2,NA,NA,3,NA,4,NA,5,6,NA),nrow = 3, byrow = T))
>colnames(data1) <- c("A","B","C","D")
>print(data1)
A B C D
1 NA 2 NA
NA 3 NA 4
NA 5 6 NA
> # Create new dataset?
Here is a potential solution using akrun's/Valentin's answer from this question.
Let's say the data is
data1 <- data.frame(matrix(c(1,NA,2,NA,NA,3,NA,4,NA,5,NA,NA),nrow = 3, byrow = T))
> data1
X1 X2 X3 X4
1 1 NA 2 NA
2 NA 3 NA 4
3 NA 5 NA NA
Then use
df1 <- t(sapply(apply(data1, 1, function(x) x[!is.na(x)]), "length<-", max(lengths(lapply(data1, function(x) x[!is.na(x)])))))
to arrive at
> df1
X1 X3
[1,] 1 2
[2,] 3 4
[3,] 5 NA

Replace values within a range in a data frame in R

I have ranked rows in a data frame based on values in each column.Ranking 1-10. not every column in picture
I have code that replaces values to NA or 1. But I can't figure out how to replace range of numbers, e.g. 3-6 with 1 and then replace the rest (1-2 and 7-10) with NA.
lag.rank <- as.matrix(lag.rank)
lag.rank[lag.rank > n] <- NA
lag.rank[lag.rank <= n] <- 1
At the moment it only replaces numbers above or under n. Any suggestions? I figure it should be fairly simple?
Is this what your are trying to accomplish?
> x <- sample(1:10,20, TRUE)
> x
[1] 1 2 8 2 6 4 9 1 4 8 6 1 2 5 8 6 9 4 7 6
> x <- ifelse(x %in% c(3:6), 1, NA)
> x
[1] NA NA NA NA 1 1 NA NA 1 NA 1 NA NA 1 NA 1 NA 1 NA 1
If your data aren't integers but numeric you can use between from the dplyr package:
x <- ifelse(between(x,3,6), 1, NA)

Find all indices of duplicates and write them in new columns

I have a data.frame with a single column, a vector of strings.
These strings have duplicate values.
I want to find the character strings that have duplicates in this vector and write their index of position in a new column.
So for example consider I have:
DT<- data.frame(string=A,B,C,D,E,F,A,C,F,Z,A)
I want to get:
string match2 match2 match3 matchx....
A 1 7 11
B 2 NA NA
C 3 8 NA
D 4 NA NA
E 5 NA NA
F 6 9 NA
A 1 7 11
C 3 8 NA
F 6 9 NA
Z 10 NA NA
A 1 7 11
The string is ways longer than in this example and I do not know the amount of maximum columns I need.
What will be the most effective way to do this?
I know that there is the duplicate function but I am not exactly sure how to combine it to the result I want to get here.
Many thanks!
Here's one way of doing this. I'm sure a data.table one liner follows.
DT<- data.frame(string=c("A","B","C","D","E","F","A","C","F","Z","A"))
# find matches
rbf <- sapply(DT$string, FUN = function(x, DT) which(DT %in% x), DT = DT$string)
# fill in NAs to have a pretty matrix
out <- sapply(rbf, FUN = function(x, mx) c(x, rep(NA, length.out = mx - length(x))), max(sapply(rbf, length)))
# bind it to the original data
cbind(DT, t(out))
string 1 2 3
1 A 1 7 11
2 B 2 NA NA
3 C 3 8 NA
4 D 4 NA NA
5 E 5 NA NA
6 F 6 9 NA
7 A 1 7 11
8 C 3 8 NA
9 F 6 9 NA
10 Z 10 NA NA
11 A 1 7 11
Here is one option with data.table. After grouping by 'string', get the sequence (seq_len(.N)) and row index (.I), then dcast to 'wide' format and join with the original dataset on the 'string'
library(data.table)
dcast(setDT(DT)[, .(seq_len(.N),.I), string],string ~ paste0("match", V1))[DT, on = "string"]
# string match1 match2 match3
# 1: A 1 7 11
# 2: B 2 NA NA
# 3: C 3 8 NA
# 4: D 4 NA NA
# 5: E 5 NA NA
# 6: F 6 9 NA
# 7: A 1 7 11
# 8: C 3 8 NA
# 9: F 6 9 NA
#10: Z 10 NA NA
#11: A 1 7 11
Or another option would be to split the sequence of rows with 'string', pad the list elements with NA for length that are less, and merge with the original dataset (using base R methods)
lst <- split(seq_len(nrow(DT)), DT$string)
merge(DT, do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))),
by.x = "string", by.y = "row.names")
data
DT<- data.frame(string=c("A","B","C","D","E","F","A","C",
"F","Z","A"), stringsAsFactors=FALSE)
And here's one that uses tidyverse tools ( not quite a one-liner ;) ):
library( tidyverse )
DT %>% group_by( string ) %>%
do( idx = which(DT$string == unique(.$string)) ) %>%
ungroup %>% unnest %>% group_by( string ) %>%
mutate( m = stringr::str_c( "match", 1:n() ) ) %>%
spread( m, idx )

Turn different sized rows into columns

I am reading in a data file with many different rows, all of which can have different lengths like so:
dataFile <- read.table("file.txt", as.is=TRUE);
The rows can be as follows:
1 5 2 6 2 1
2 6 24
2 6 1 5 2 7 982 24 6
25 2
I need the rows to be transformed into columns. I'll be then using the columns for a violin plot like so:
names(dataCol)[1] <- "x";
jpeg("violinplot.jpg", width = 1000, height = 1000);
do.call(vioplot,c(dataCol,))
dev.off()
I'm assuming there will be an empty string/placeholder for any column with fewer entries than the column with the maximum number of entries. How can it be done?
Use the fill = TRUE argument in read.table. Then to change rows to columns, use t to transpose. Using your data this would look like...
df <- read.table( text = "1 5 2 6 2 1
2 6 24
2 6 1 5 2 7 982 24 6
25 2
" , header = FALSE , fill = TRUE )
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1 1 5 2 6 2 1 NA NA NA
#2 2 6 24 NA NA NA NA NA NA
#3 2 6 1 5 2 7 982 24 6
#4 25 2 NA NA NA NA NA NA NA
t(df)
# [,1] [,2] [,3] [,4]
#V1 1 2 2 25
#V2 5 6 6 2
#V3 2 24 1 NA
#V4 6 NA 5 NA
#V5 2 NA 2 NA
#V6 1 NA 7 NA
#V7 NA NA 982 NA
#V8 NA NA 24 NA
#V9 NA NA 6 NA
EDIT: apparently read.table has a fill=TRUE option, which is WAYYYY easier than my answer.
I've never used vioplot before, and that seems like a weird way to make a function call (instead of something like vioplot(dataCol)), but I have worked with ragged arrays before, so I'll try that.
Have you read the data in yet? That tends to be the hardest part. The code below reads the above data from a file called temp.txt into a matrix called out2
file = 'temp.txt'
dat = readChar(file,file.info(file)$size)
split1 = strsplit(dat,"\n")
split2 = strsplit(split1[[1]]," ")
n = max(unlist(lapply(split2,length)))
out=matrix(nrow=n,ncol=length(split2))
tFun = function(i){
vect = as.numeric(split2[[i]])
length(vect)=n
out[,i]=vect
}
out2 = sapply(1:length(split2),tFun)
I'll try and explain what I've done: the first step is to read in every character via readChar. You then split the lines, then the elements within each line to get the list split2, where each element of the list is a row of the input file.
From there you create a blank matrix that would be the right size for your data, then iterate through the list and assign each element to a column.
It's not pretty, but it works!

How does one merge dataframes by row name without adding a "Row.names" column?

If I have two data frames, such as:
df1 = data.frame(x=1:3,y=1:3,row.names=c('r1','r2','r3'))
df2 = data.frame(z=5:7,row.names=c('r5','r6','r7'))
(
R> df1
x y
r1 1 1
r2 2 2
r3 3 3
R> df2
z
r5 5
r6 6
r7 7
), I'd like to merge them by row names, keeping everything (so an outer join, or all=T). This does it:
merged.df <- merge(df1,df2,all=T,by='row.names')
R> merged.df
Row.names x y z
1 r1 1 1 NA
2 r2 2 2 NA
3 r3 3 3 NA
4 r5 NA NA 5
5 r6 NA NA 6
6 r7 NA NA 7
but I want the input row names to be the row names in the output dataframe (merged.df).
I can do:
rownames(merged.df) <- merged.df[[1]]
merged.df <- merged.df[-1]
which works, but seems inelegant and hard to remember. Anyone know of a cleaner way?
Not sure if it's any easier to remember, but you can do it all in one step using transform.
transform(merge(df1,df2,by=0,all=TRUE), row.names=Row.names, Row.names=NULL)
# x y z
#r1 1 1 NA
#r2 2 2 NA
#r3 3 3 NA
#r5 NA NA 5
#r6 NA NA 6
#r7 NA NA 7
From the help of merge:
If the matching involved row names, an extra character column called
Row.names is added at the left, and in all cases the result has
‘automatic’ row names.
So it is clear that you can't avoid the Row.names column at least using merge. But maybe to remove this column you can subset by name and not by index. For example:
dd <- merge(df1,df2,by=0,all=TRUE) ## by=0 easier to write than row.names ,
## TRUE is cleaner than T
Then I use row.names to subset like this :
res <- subset(dd,select=-c(Row.names))
rownames(res) <- dd[,'Row.names']
x y z
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 NA NA 5
5 NA NA 6
6 NA NA 7

Resources