How do I create functions to select every second value in a column in a data frame in R, but from the second value in the column?
I tried something like this:
df.new = df[seq(1, nrow(df), 2), ]
You can use c(FALSE, TRUE) to subset the data.frame and get every second row starting with the second.
x[c(FALSE, TRUE),]
# a b
#2 2 9
#4 4 7
#6 6 5
#8 8 3
#10 10 1
And for a specific column:
x$a[c(FALSE, TRUE)]
#[1] 2 4 6 8 10
Data
x <- data.frame(a = 1:10, b=10:1)
I have three different data frames that are similar in their columns such:
df1 df2 df3
Class 1 2 3 Class 1 2 3 Class 1 2 3
A 5 3 2 A 7 3 10 A 5 4 1
B 9 1 4 B 2 6 2 A 2 6 2
C 7 9 8 C 4 7 1 A 12 3 8
I would like to iterate through the three files and select the data from the columns with similar name. In other words, I want to iterate three times and everytime select data of column 1, then column 2, and then column 3 and merge them in one data frame.
To do that, I did the following:
df1 <- read.csv(R1)
df2 <- read.csv(R2)
df3 <- read.csv(R3)
df <- data.frame(Class=character(), B1_1=integer(), B1_2=integer(), B1_3=integer(), stringsAsFactors=FALSE)
for(i in 1:3){
nam <- paste("X", i, sep = "") #here I want to call the column name such as X1, X2, and X3
df[seq_along(df1[nam]), ]$B1_1 <- df1[nam]
df[seq_along(df2[nam]), ]$B1_2 <- df2[nam]
df[seq_along(df3[nam]), ]$B1_3 <- df3[nam]
df$Class <- df1$Class
}
In this line df[seq_along(df1[nam]), ]$B1_1 <- df1[nam], I followed the solution from this but this produces the following error:
Error in `$<-.data.frame`(`*tmp*`, "B1_1", value = list(X1 = c(5L, 7L, :
replacement has 10 rows, data has 1
Do you have any idea how to solve it?
I want to take a row and use it to set the colnames example below
df1a = data.frame(Customer = c("A", "a",1:8), Product = c("B", "b",11:18))
colnames(df1a)<-df1a[2,]
Expected output
a b
1 A B
2 a b
3 1 11
4 2 12
5 3 13
6 4 14
7 5 15
8 6 16
9 7 17
10 8 18
I think the problem is that df1a[2,] is a data frame
Here the columns are factor class as by default stringsAsFactors = TRUE in the data.frame call. So, the values that got changed are the integer storage values of the factor rather than the acutal values
df1a <- data.frame(Customer = c("A", "a",1:8),
Product = c("B", "b",11:18), stringsAsFactors = FALSE)
and then do the assignment
names(df1a) <- unlist(df1a[2,])
Or as #Ryan mentioned, unlist is not needed
names(df1a) <- df1a[2,]
You can change the names without generating a new data.frame:
names(df1a) <- lapply(df1a[2,], as.character)
If I have 5 data frames in the global environment, such as a,b,c,d,and e
I want the data frame a to be compared with e, and if R finds any common elements in a and e, delete the elements in a. then I want the data frame b to be compared with e and delete the common elements, and so on.
Actually I have 20 tables need to be compared with e.
Can anyone give some elegant way to handle this problem? I'm thinking of loop or functions but can't work the details out.
Thanks everybody and have a nice day!
The easiest would be to put all the dataframes you want to compare in a list, then use lapply to loop over this list:
# create list of data.frames
dlist <- list(df1 = data.frame(var1 = 1:10), df2 = data.frame(var1 = 11:20),
df3 = data.frame(var1 = 21:30), df4 = data.frame(var1 = 31:40))
# create master-data.frame
set.seed(1)
df <- data.frame(var1 = sample(1:100, 30))
# use lapply() to loop over the data and exclude all elements that are in the master-data.frame
dlist <- lapply(dlist, function(x){
x <- x[!x$var1 %in% df$var1, , drop = FALSE]
})
Result:
> dlist
$df1
var1
2 2
3 3
4 4
5 5
7 7
8 8
9 9
$df2
var1
1 11
2 12
3 13
4 14
5 15
8 18
$df3
var1
2 22
3 23
4 24
6 26
10 30
$df4
var1
1 31
3 33
5 35
6 36
8 38
9 39
10 40
If you absolutely need the dataframes in your global directory, you could use list2env:
list2env(dlist, envir = .GlobalEnv)
I have data frames like the following that I need to reformat into a single row, so that I can create a new data frame that's a collection of many of the simpler data frames, with one row in the new data frame representing all of the data of one of the simpler original data frames.
Here's a trivial example of the format of the original data frames:
> myDf = data.frame(Seconds=seq(0,1,.25), s1=seq(0,8,2), s2=seq(1,9,2))
>
> myDf
Seconds s1 s2
1 0.00 0 1
2 0.25 2 3
3 0.50 4 5
4 0.75 6 7
5 1.00 8 9
And below is what I want it to look like after being reformatted. Each column indicates rXsY, where "rX" indicates the row number of the original data frame, and "sY" indicates the "s1" or "s2" column of the original data frame. The "Seconds" column is omitted in the new data frame, as its information is implicit in the row number.
> myNewDf
r1s1 r1s2 r2s1 r2s2 r3s1 r3s2 r4s1 r4s2 r5s1 r5s2
1 0 1 2 3 4 5 6 7 8 9
I suspect this is really simple and probably involves some combination of reshape(), melt(), and/or cast(), but the proper incantations are escaping me. I could post what I've tried, but I think it would just distract from what's probably a simple question? If anyone would like me to do so, just ask in the comments.
The ideal solution would also somehow programmatically generate the new column names based on the original data frame's column names, since the column names won't always be the same. Also, if it's not difficult, can I somehow simultaneously do this same operation to a list of similar data frames (all the same number of rows, all the same column names, but with differing values in their s1 & s2 columns)? Ultimately I need a single data frame that contains the data from multiple simpler data frames, like this...
> myCombinedNewDf # data combined from 4 separate original data frames
r1s1 r1s2 r2s1 r2s2 r3s1 r3s2 r4s1 r4s2 r5s1 r5s2
1 0 1 2 3 4 5 6 7 8 9
2 10 11 12 13 14 15 16 17 18 19
3 20 21 22 23 24 25 26 27 28 29
4 30 31 32 33 34 35 36 37 38 39
Using melt() from reshape2, you can do it like this:
library(reshape2)
# Melt the data, omitting `Seconds`
df.melted <- melt(myDF[, -1], id.vars = NULL)
# Transpose the values into a single row
myNewDF <- t(df.melted[, 2])
# Assign new variable names
colnames(myNewDF) <- paste0("r", rownames(myDF), df.melted[, 1])
# r1s1 r2s1 r3s1 r4s1 r5s1 r1s2 r2s2 r3s2 r4s2 r5s2
# 1 0 2 4 6 8 1 3 5 7 9
This melts the data frame, uses the first column (the variable names from the original dataset) to construct the variable names for the new dataset, and uses the transpose of the second column (the data values) as the row of data.
If you want an automated approach to combining your datasets, you can take this a step further:
# Another data frame
myOtherDF <- data.frame(Seconds = seq(0, 1, 0.25),
s1 = seq(1, 9, 2),
s2 = seq(0, 8, 2))
# Turn the above steps into a function
colToRow <- function(x) {
melted <- melt(x[, -1], id.vars = NULL)
row <- t(melted[, 2])
colnames(row) <- paste0("r", rownames(x), melted[, 1])
row
}
# Create a list of the data frames to process
myDFList <- list(myDF, myOtherDF)
# Apply our function to each data frame in the list and append
myNewDF <- data.frame(do.call(rbind, lapply(myDFList, colToRow)))
# r1s1 r2s1 r3s1 r4s1 r5s1 r1s2 r2s2 r3s2 r4s2 r5s2
# 1 0 2 4 6 8 1 3 5 7 9
# 2 1 3 5 7 9 0 2 4 6 8
The relevant values can be extracted row-wise by using c(t(therelevantdata)).
In other words:
Values <- c(t(myDf[-1]))
If names are important at this point, you can do:
Names <- sprintf("r%ss%s", rep(1:5, each = 2), 1:2)
You can get a named vector with:
setNames(Values, Names)
# r1s1 r1s2 r2s1 r2s2 r3s1 r3s2 r4s1 r4s2 r5s1 r5s2
# 0 1 2 3 4 5 6 7 8 9
Or a named single-row data.frame with:
setNames(data.frame(t(Values)), Names)
# r1s1 r1s2 r2s1 r2s2 r3s1 r3s2 r4s1 r4s2 r5s1 r5s2
# 1 0 1 2 3 4 5 6 7 8 9
If you have a list of your data.frames, as shared in #cyro111's answer, you can easily do the following:
do.call(rbind, lapply(myDfList, function(x) c(t(x[-1]))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 1 2 3 4 5 6 7 8 9
# [2,] 10 11 12 13 14 15 16 17 18 19
Convert to data.frame with as.data.frame and add the names with either names <- or setNames.
Generalized as a function:
myFun <- function(indf, asVec = TRUE) {
values <- c(t(indf[-1]))
Names <- sprintf("r%ss%s", rep(1:nrow(indf), each = ncol(indf[-1])),
1:ncol(indf[-1]))
out <- setNames(values, Names)
if (isTRUE(asVec)) out
else (as.data.frame(as.matrix(t(out))))
}
Try it out:
myFun(myDf) # Vector
myFun(myDf, FALSE) # data.frame
It's even more convenient on a list of data.frames.... lots of options :-)
dfList1 <- list(
data.frame(s = 1:2, a1 = 1:2, a2 = 3:4, a3 = 5:6),
data.frame(s = 1:2, a1 = 11:12, a2 = 31:32, a3 = 51:52)
)
lapply(dfList1, myFun)
do.call(rbind, lapply(dfList1, myFun))
t(sapply(dfList1, myFun))
as.data.frame(do.call(rbind, lapply(dfList1, myFun)))
You can try dcast from the devel version of data.table i.e. v1.9.5 which can take multiple value.var columns. Create two columns one with row number ('rn') and the second a grouping variable ('grp'), and use dcast. The installation details are here
library(data.table)#v1.9.5+
dcast(setDT(myDf[-1])[, c('rn1', 'grp') := list(paste0('r', 1:.N), 1)],
grp~rn1, value.var=c('s1', 's2'))
# grp r1_s1 r2_s1 r3_s1 r4_s1 r5_s1 r1_s2 r2_s2 r3_s2 r4_s2 r5_s2
#1: 1 0 2 4 6 8 1 3 5 7 9
Or we can use reshape from base R
reshape(transform(myDf, rn1=paste0('r', 1:nrow(myDf)), grp=1)[-1],
idvar='grp', timevar='rn1', direction='wide')
# grp s1.r1 s2.r1 s1.r2 s2.r2 s1.r3 s2.r3 s1.r4 s2.r4 s1.r5 s2.r5
#1 1 0 1 2 3 4 5 6 7 8 9
Update
If we have several dataframes, we can place the datasets in a list and then use lapply with dcast or rbind the datasets in the list with rbindlist specifying a grouping variable for each dataset,then apply dcast on the whole dataset.
Using 'myOtherDF` from #Alex A.'s post
myDFList <- list(myDf, myOtherDF)
dcast(rbindlist(Map(cbind, myDFList, gr=seq_along(myDFList)))[,-1,
with=FALSE][, rn1:= paste0('r', 1:.N), by=gr],
gr~rn1, value.var=c('s1', 's2'))
# gr r1_s1 r2_s1 r3_s1 r4_s1 r5_s1 r1_s2 r2_s2 r3_s2 r4_s2 r5_s2
#1: 1 0 2 4 6 8 1 3 5 7 9
#2: 2 1 3 5 7 9 0 2 4 6 8
base R solution
#prepare data
myDf1 = data.frame(Seconds=seq(0,1,.25), s1=seq(0,8,2), s2=seq(1,9,2))
myDf2 = data.frame(Seconds=seq(0,1,.25), s1=seq(10,18,2), s2=seq(11,19,2))
myDfList=list(myDf1,myDf2)
#allocate memory
myCombinedNewDf=data.frame(matrix(NA_integer_,nrow=length(myDfList),ncol=(ncol(myDf1)-1)*nrow(myDf1)))
#reformat
for (idx in 1:length(myDfList)) myCombinedNewDf[idx,]=c(t(myDfList[[idx]][,-1]))
#set colnames
colnames(myCombinedNewDf)=paste0("r",sort(rep.int(1:nrow(myDf1),2)),colnames(myDf1)[-1])
As per request an extended version that handles a separate factor column:
#allocate memory
#the first column should ultimately be a factor
#I would use a character column first and later change it to type factor
#note the stringsAsFactors option!
myCombinedNewDf=data.frame(rep(NA_character_,length(myDfList)),
matrix(NA_integer_,
nrow=length(myDfList),
ncol=(ncol(myDf1)-1)*nrow(myDf1)),
stringsAsFactors=FALSE)
#reformat
for (idx in 1:length(myDfList)) {
myCombinedNewDf[idx,-1]=c(t(myDfList[[idx]][,-1]))
#I have just made up some criterion to get one "yes" and one "no"
#"yes" if the sum of all values is below 100, "no" otherwise
myCombinedNewDf[idx,1]=if (sum(myDfList[[idx]][,-1])<100) "yes" else "no"
}
#set colnames
colnames(myCombinedNewDf)=c("flag",
paste0("r",
sort(rep.int(1:nrow(myDf1),2)),
colnames(myDf1)[-1])
)
myCombinedNewDf$flag=factor(myCombinedNewDf$flag)
myCombinedNewDf