Assume a data.frame:
df <- data.frame(name = c("a","b","c","d","e"),rank = c(1,1,4,3,2))
name rank
a 1
b 1
c 4
d 3
e 2
Based on the above data.frame, I want to create a new one that holds the count of transitions from one rank to another. So the output would be something like this:
name 1to1 1to2 1to3 1to4 2to1 2to2 2to3 2to4 3to1 3to2 3to3 3to4 4to1 4to2 4to3 4to4
1 b 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 c NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA
3 d NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
4 e NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA
One way to do this would be to run a for loop and then using ifs but I am pretty sure there should be a more efficient way of doing this.
For example, if item d has a rank of 3 and item c is ranked as 4 then the code should increase the count of the 4to3 column under d's row (as per example above). Please let me know if this is unclear and I appreciate all the help.
P.S. colnames are not that important.
You could use Map to create sequences for extracting the transitions and collapse them into the desired form using paste.
tmp <- sapply(Map(seq, 1:(nrow(df1)-1), 2:nrow(df1)), function(i) df1$rank[i])
v <- apply(tmp, 2, function(x) paste(x, collapse="to"))
Then create a grid with all permutations
to <- apply(expand.grid(1:4, 1:4), 1, function(x) paste(x, collapse="to"))
and compare them with the actual transitions to get the resulting binary structure; create a data frame out of it.
res <- data.frame(name=df1$name[-1], t(sapply(v, function(i) setNames(+(i == to), to))))
Afterwards, you may convert the zeroes to NA using
res[res == 0] <- NA
Result
res
# name X1to1 X2to1 X3to1 X4to1 X1to2 X2to2 X3to2 X4to2 X1to3 X2to3 X3to3 X4to3 X1to4 X2to4 X3to4 X4to4
# 1to1 b 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 1to4 c NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA
# 4to3 d NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA
# 3to2 e NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA
Data
df1 <- structure(list(name = structure(1:5, .Label = c("a", "b", "c",
"d", "e"), class = "factor"), rank = c(1, 1, 4, 3, 2)), class = "data.frame", row.names = c(NA,
-5L))
Related
I have created a data frame, in the data frame there are 3 sites and I have created a nested for loop to create my desired matrices. THe overall objective is find a more efficient way to do this for each of the 3 sites instead of just the one.
The outputs from the nested for loop (EDmatrix and timelags) are the expected results for the other two sites. I would like to find a more efficient way of obtaining these matrices as well as be able to do it for all site instead of just the one in this example.
set.seed(123)
d1 = sample.int(50, 27)
d2 = sample.int(50, 27)
d3 = sample.int(50, 27)
year <- c(1990:1998)
site <- c(rep("a", 9), rep("b", 9), rep("c", 9))
ED = function(x,y){
#x and y are vectors of spp abundances
#they must be the same length!
if(length(x)!=length(y)) stop("Bad abundances!")
out = sqrt(sum((x-y)^2))
out
}
df <- data.frame(site, year, d1 = d1, d2 = d2, d3 = d3)
Here is the code to get the expected output for only a single site, but I would like to be able to do this for all of the sites in the data frame df.
subdf = subset(df,site=="a") # subset data for one site
EDmatrix = matrix(NA,dim(subdf)[1],dim(subdf)[1]) # create a place to store the dissimilarity values
timeLags = matrix(NA,dim(subdf)[1],dim(subdf)[1]) # create a place to store the time lags
# First loop through all "j" years from 1 to the total number of years
# Now loop through all "k" years from 1 to the total number of years
for(j in 1: length(subdf$year)){
for(k in 1: length(subdf$year)){
# grab density data for year "j"
jdensity <- subdf[j,-c(1:2)]
# grab density data for year "k"
kdensity <- subdf[k,-c(1:2)]
# calculate and store (in the EDmatrix) the ED value based on the data for year j and k
EDmatrix[j,k] <- ED(jdensity, kdensity)
# calculate and store (in timeLags) the time lag (the absolute value of the difference
# in time between year j and k
timeLags[j,k] <- abs(subdf[j, 2] - subdf[k, 2])
}# exit k loop
}# exit j loop
EDmatrix[lower.tri(EDmatrix, diag=T)]=NA # set duplicate entries to NA
timeLags[lower.tri(timeLags, diag=T)]=NA # set duplicate entries to NA
y = as.vector(EDmatrix) # turn the matrix into a vector
x = as.vector(timeLags)
We may use outer for this operation
library(dplyr)
library(tidyr)
library(purrr)
f1 <- function(dat, i, j) {
subdat <- dat %>%
select(starts_with('d'))
jdensity <- subdat[i, ]
kdensity <- subdat[j,]
EDtmp <- ED(jdensity, kdensity)
timetmp <- abs(dat$year[i] - dat$year[j])
tibble(EDtmp, timetmp)
}
f2 <- function(dat, s1, s2) {
mat <- outer(s1, s2, Vectorize(\(i, j) list(f1(dat, i, j))))
EDmatrix <- matrix(map_dbl(mat, ~ .x$EDtmp), length(s1), length(s1))
timeLags <- matrix(map_dbl(mat, ~ .x$timetmp), length(s1), length(s1))
EDmatrix[lower.tri(EDmatrix, diag=TRUE)]=NA
timeLags[lower.tri(timeLags, diag=TRUE)]=NA
y = as.vector(EDmatrix)
x = as.vector(timeLags)
tibble(y, x)
}
out1 <- df %>%
group_by(site) %>%
summarise(out = f2(cur_data(), row_number(), row_number()),
.groups = 'drop') %>%
unnest(out)
-checking with OP's output
> out1$x[out1$site == "a"]
[1] NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 2 1 NA NA NA NA NA NA NA 3 2 1 NA NA NA NA NA NA 4 3 2 1 NA NA NA NA NA 5 4 3
[49] 2 1 NA NA NA NA 6 5 4 3 2 1 NA NA NA 7 6 5 4 3 2 1 NA NA 8 7 6 5 4 3 2 1 NA
> x
[1] NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 2 1 NA NA NA NA NA NA NA 3 2 1 NA NA NA NA NA NA 4 3 2 1 NA NA NA NA NA 5 4 3
[49] 2 1 NA NA NA NA 6 5 4 3 2 1 NA NA NA 7 6 5 4 3 2 1 NA NA 8 7 6 5 4 3 2 1 NA
> out1$y[out1$site == "a"]
[1] NA NA NA NA NA NA NA NA NA 30.675723 NA NA NA NA
[15] NA NA NA NA 41.388404 18.055470 NA NA NA NA NA NA NA 42.485292
[29] 33.136083 25.729361 NA NA NA NA NA NA 38.288379 41.581246 34.770677 39.433488 NA NA
[43] NA NA NA 13.038405 38.379682 49.264592 54.083269 40.865633 NA NA NA NA 16.431677 25.317978
[57] 36.701499 47.549974 36.359318 15.362291 NA NA NA 34.799425 54.680892 54.018515 49.254441 26.019224 35.791060 41.484937
[71] NA NA 9.433981 34.842503 46.108568 42.801869 45.199558 19.924859 25.079872 38.652296 NA
> y
[1] NA NA NA NA NA NA NA NA NA 30.675723 NA NA NA NA
[15] NA NA NA NA 41.388404 18.055470 NA NA NA NA NA NA NA 42.485292
[29] 33.136083 25.729361 NA NA NA NA NA NA 38.288379 41.581246 34.770677 39.433488 NA NA
[43] NA NA NA 13.038405 38.379682 49.264592 54.083269 40.865633 NA NA NA NA 16.431677 25.317978
[57] 36.701499 47.549974 36.359318 15.362291 NA NA NA 34.799425 54.680892 54.018515 49.254441 26.019224 35.791060 41.484937
[71] NA NA 9.433981 34.842503 46.108568 42.801869 45.199558 19.924859 25.079872 38.652296 NA
I have been facing an error while reading a csv file. first few lines of the line is as given below:
"","1.CEL","2.CEL","3.CEL","4.CEL"
"1_s_at",NA,NA,NA,NA
"2_at",NA,NA,NA,NA
"3_at",NA,NA,NA,NA
"4_at",NA,NA,NA,NA
"5_g_at",NA,NA,NA,NA
"6_at",NA,NA,NA,NA
"7_at",NA,NA,NA,NA
reading the csv.file
test <- read.csv(file='/home/userxyz/test.csv')
head(test)
# X X1.CEL X2.CEL X3.CEL X4.CEL
#1 1_s_at NA NA NA NA
#2 2_at NA NA NA NA
#3 3_at NA NA NA NA
#4 4_at NA NA NA NA
#5 5_g_at NA NA NA NA
#6 6_at NA NA NA NA
Explicitly specifying the presence of the header.
test <- read.csv(file='/home/userxyz/test.file', header=T)
head(test)
# X X1.CEL X2.CEL X3.CEL X4.CEL
#1 1_s_at NA NA NA NA
#2 2_at NA NA NA NA
#3 3_at NA NA NA NA
#4 4_at NA NA NA NA
#5 5_g_at NA NA NA NA
#6 6_at NA NA NA NA
While explicitly specifying the row.names, it didn't work.
test <- read.csv(file='/home/userxyz/test.file', row.names=T)
#Error in read.table(file = file, header = header, sep = sep, quote = quote, :
# invalid 'row.names' specification
read.table, read.delim functions have also been looked at.
Is the error because of special characters in the row.names?
I think you are trying to read in the first column as row name. Try:
x <- '"","1.CEL","2.CEL","3.CEL","4.CEL"
"1_s_at",NA,NA,NA,NA
"2_at",NA,NA,NA,NA
"3_at",NA,NA,NA,NA
"4_at",NA,NA,NA,NA
"5_g_at",NA,NA,NA,NA
"6_at",NA,NA,NA,NA
"7_at",NA,NA,NA,NA'
read.csv(text = x, row.names = 1L)
# X1.CEL X2.CEL X3.CEL X4.CEL
#1_s_at NA NA NA NA
#2_at NA NA NA NA
#3_at NA NA NA NA
#4_at NA NA NA NA
#5_g_at NA NA NA NA
#6_at NA NA NA NA
#7_at NA NA NA NA
If you want to preserve exactly the header, do
read.csv(text = x, row.names = 1L, check.names = FALSE)
# 1.CEL 2.CEL 3.CEL 4.CEL
#1_s_at NA NA NA NA
#2_at NA NA NA NA
#3_at NA NA NA NA
#4_at NA NA NA NA
#5_g_at NA NA NA NA
#6_at NA NA NA NA
#7_at NA NA NA NA
Regarding row.name, read ?read.csv:
row.names: a vector of row names. This can be a vector giving the
actual row names, or a single number giving the column of the
table which contains the row names, or character string
giving the name of the table column containing the row names.
I'm trying to create a vector using data from my data frame which contains all of the numeric values in the data frame.
Basically, I want a vector that has (2,2,5,2,2,3,2,3,2,2,2,2,2).
two three four five six seven
2 NA NA NA NA NA
2 NA NA NA NA NA
NA NA NA 5 NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
NA 3 NA NA NA NA
2 NA NA NA NA NA
NA 3 NA NA NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
2 NA NA NA NA NA
Just subset the dataframe for non-NA values with !is.na(df):
df <- data.frame(two = c(2, 2, NA),
three = c(NA, NA, NA),
four = c(NA, 3, NA))
df
# two three four
# 1 2 NA NA
# 2 2 NA 3
# 3 NA NA NA
is.na(df)
# two three four
# [1,] FALSE TRUE TRUE
# [2,] FALSE TRUE FALSE
# [3,] TRUE TRUE TRUE
df[!is.na(df)]
# [1] 2 2 3
I have a function that takes as input a dataframe with certain columns
columns =['a', 'b',...,'z']
Now I have a dataframe DF with only few of these columns DF_columns = ['f', 'u', 'z']
How can I create a dataframe that has all the columns with value NA if the columns are not in DF and that coincides with DF on the columns ['f', 'u', 'z']
Example:
d = data.frame('g'=c(1,2,3), 's' = c(4,2,3))
columns = letters[1:21]
columns
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
[21] "u"
> d
g s
1 1 4
2 2 2
3 3 3
>
x.or.na <- function(x, df) if (x %in% names(df)) df[[x]] else NA
as.data.frame(Map(x.or.na, columns, list(d)))
set.seed(42)
DF <- setNames(as.data.frame(matrix(sample(1:15, 15, replace=TRUE), ncol=3)), c('f', 'u', 'z') )
DF
# f u z
#1 14 8 7
#2 15 12 11
#3 5 3 15
#4 13 10 4
#5 10 11 7
res <- do.call(`data.frame`,lapply(split(letters[4:26], letters[4:26]),
function(x){x1 <- match(x, colnames(DF)); if(!is.na(x1)) DF[,x1] else NA}))
res
# d e f g h i j k l m n o p q r s t u v w x y z
#1 NA NA 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 8 NA NA NA NA 7
#2 NA NA 15 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 12 NA NA NA NA 11
#3 NA NA 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 NA NA NA NA 15
#4 NA NA 13 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 10 NA NA NA NA 4
#5 NA NA 10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 11 NA NA NA NA 7
Using dplyr
library(dplyr)
DF %>%
do({x1 <-data.frame(., setNames(as.list(rep(NA, sum(!letters[4:26] %in% names(DF)))),
setdiff(letters[4:26], names(DF))))
x1[,order(colnames(x1))] })
# d e f g h i j k l m n o p q r s t u v w x y z
#1 NA NA 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 8 NA NA NA NA 7
#2 NA NA 15 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 12 NA NA NA NA 11
#3 NA NA 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 3 NA NA NA NA 15
#4 NA NA 13 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 10 NA NA NA NA 4
#5 NA NA 10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 11 NA NA NA NA 7
This is quite easy (in terms of syntax) and efficient (in terms of speed) using the data.table package:
require(data.table) ## 1.9.2+
setDT(d)[, setdiff(columns, names(d)) := NA] ## (1)
setcolorder(d, columns) ## (2)
setDF(d) ## (3)
setDT converts d to a data.table, after which we use the := operator to create new columns by reference. There are many ways to use :=, but highlighted here is the use case LHS := RHS. Here LHS is a vector of column names and RHS is the value. NA is provided only once on the RHS, which gets automatically recycled for all other columns. Note that NA by default is logical type in R.
If required you can reorder the columns of d in the same order as columns using setcolorder.
Again, if necessary, you can convert the data.table back to a data.frame, using the function setDF, which again modifies the object by reference. But it's available in the development version v1.9.3 only for now.
Here are a few methods and their timings.
createDF1 <- function(colVec, data)
{
m <- matrix(, nrow = nrow(data), ncol = length(colVec),
dimnames = list(NULL, colVec))
m[, names(data)] <- as.matrix(data)
data.frame(apply(m, 2, as.numeric))
}
createDF2 <- function(colVec, data)
{
rr <- setNames(rep(list(rep(NA_integer_, nrow(data))), length(colVec)), .
nm = colVec)
rr[match(names(data), colVec)] <- data
as.data.frame(rr)
}
createDF3 <- function(colVec, data)
{
rr <- setNames(replicate(length(colVec),
list(rep(NA_integer_, nrow(data)))),
nm = colVec)
rr[match(names(d), colVec)] <- data
as.data.frame(rr)
}
Create a 3,000,000 x 3 data frame to test on:
columns <- letters[1:21]
d <- data.frame(g = 1:3e6L, s = 1:3e6L, j = 1:3e6L)
Run some tests:
system.time({ createDF1(columns, d) })
# user system elapsed
# 5.022 1.023 6.054
system.time({ createDF2(columns, d) })
# user system elapsed
# 0.007 0.004 0.011
system.time({ createDF3(columns, d) })
# user system elapsed
# 0.105 0.077 0.183
Of these three, it looks like rep(list(rep(NA_integer_, nrow(data))), length(columns)) is the way to go, and replace values from that.
Setup:
set.seed(1)
DF_all <- setNames(data.frame(matrix(rnorm(5*26), nrow=5, ncol=26)), letters)
DF <- DF_all[, c('f','u','z')]
Create a new empty dataframe and populate with your columns:
DF2 <- setNames(data.frame(matrix(nrow=5, ncol=26)), letters)
DF2[, c('f','u','z')] <- DF[, c('f','u','z')]
Result:
> DF2
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 NA NA NA NA NA -0.05612874 NA NA NA NA NA NA NA NA NA NA NA NA NA NA -0.62036668 NA NA NA NA 0.71266631
2 NA NA NA NA NA -0.15579551 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0.04211587 NA NA NA NA -0.07356440
3 NA NA NA NA NA -1.47075238 NA NA NA NA NA NA NA NA NA NA NA NA NA NA -0.91092165 NA NA NA NA -0.03763417
4 NA NA NA NA NA -0.47815006 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0.15802877 NA NA NA NA -0.68166048
5 NA NA NA NA NA 0.41794156 NA NA NA NA NA NA NA NA NA NA NA NA NA NA -0.65458464 NA NA NA NA -0.32427027
[<- could be used to fill up the missing columns with NA.
`[<-`(d,, setdiff(columns, names(d)), NA)[columns]
#`[<-`(d,, columns[!columns %in% names(d)], NA)[columns] #Alternative
# a b c d e f g h i j k l m n o p q r s t u
#1 NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA 4 NA NA
#2 NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA NA NA 2 NA NA
#3 NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA NA 3 NA NA
Or directly adding the missing columns to the original data.frame
d[columns[!columns %in% names(d)]] <- NA
d[columns]
# a b c d e f g h i j k l m n o p q r s t u
#1 NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA 4 NA NA
#2 NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA NA NA 2 NA NA
#3 NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA NA 3 NA NA
Or in a function:
f <- function(DF, COL) {
d[columns[!columns %in% names(d)]] <- NA
d[columns]
}
f(d, columns)
# a b c d e f g h i j k l m n o p q r s t u
#1 NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA 4 NA NA
#2 NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA NA NA 2 NA NA
#3 NA NA NA NA NA NA 3 NA NA NA NA NA NA NA NA NA NA NA 3 NA NA
Data
d <- data.frame('g'=c(1,2,3), 's' = c(4,2,3))
columns <- letters[1:21]
I'm pretty frustrated because I dont know how I achieve the naming of the columns and rows in a list of data.frames. I mean I want to avoid using a loop. So I figured I could use just lapply.
Ok at first I have the following list:
>a
$nem.greedyMAP.FALSE.POS
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA NA NA
8 NA NA NA NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA NA NA NA
$nem.greedyMAP.FALSE.NEG
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA NA NA
8 NA NA NA NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA NA NA NA
Of course this list is much bigger, otherwise I wouldnt be worth the trouble.
However I want to rename the columns and rows for all data.frames the same.
So I though I could use:
lapply(a, function(x) {colnames(x) <- paste("col",1:10,sep="")})
But nothing happens. How could I achieve this. Or is lapply the wrong way?
Thanks
I'll prefer setNames in this case
set.seed(1)
datalist <- list(dat1 = data.frame(A = 1:10, B = rnorm(10)),
dat2 = data.frame(C = 100:109, D = rnorm(10))
)
lapply(datalist, names)
## $dat1
## [1] "A" "B"
## $dat2
## [1] "C" "D"
datalist <- lapply(datalist, setNames, paste0("col", 1:2))
lapply(datalist, names)
## $dat1
## [1] "col1" "col2"
## $dat2
## [1] "col1" "col2"
EDIT
A more general solution to modify rownames and colnames within a list
lapply(datalist, "colnames<-", paste0("col", 1:2))
lapply(datalist, "rownames<-", letters[1:10])
You need to remember that the object x inside the lapply is not the original object, but a copy. Changing the colnames of the copy does not impact the original object. You need to return x in order to get a new copy of the object that includes the new names.
new_obj = lapply(a, function(x) {
colnames(x) <- paste("col",1:10,sep="")
return(x)
})