This is driving me nuts. I am trying to sort a data frame by the first row in ascendinging order using the order function. Below a minimal example:
values <- c(11,10,9,8,7,6,5,4,3,2,1)
labels <- c("A","B","C","D","E","F","G","H","I","J","K")
df <- data.frame(rbind(values,labels))
newdf <- df[,with(df,order(df[1,]))]
print(newdf)
I have also tried this with
newdf <- df[,order(df[1,])]
Here is the output I'm getting
X11 X2 X1 X10 X9 X8 X7 X6 X5 X4 X3
values 1 10 11 2 3 4 5 6 7 8 9
labels K B A J I H G F E D C
Which is clearly wrong! So what is going on here?
This is an odd way to structure your data in R, so it will cause headaches, but you can make it work. See #thelatemail 's comment re: columns vs rows. To make this work in your case, do:
values <- c(11,10,9,8,7,6,5,4,3,2,1)
labels <- c("A","B","C","D","E","F","G","H","I","J","K")
df <- data.frame(rbind(values,labels), stringsAsFactors = FALSE)
newdf <- df[order(as.numeric(df["values",]))]
newdf
# X11 X10 X9 X8 X7 X6 X5 X4 X3 X2 X1
# values 1 2 3 4 5 6 7 8 9 10 11
# labels K J I H G F E D C B A
Note, in particular, stringsAsFactors = FALSE when you create the data frame.
Remember, data.frames are lists, and each element of the list is a vector (possibly a list, but typically an atomic vector, especially if constructed in a standard way) of the same length. The individual elements of the data frame are columns. Rows are just the nested elements with the same index value. This makes it much easier to work with a data frame like this:
df <- data.frame(values = values, labels = labels)
df[order(df$values),]
# values labels
# 11 1 K
# 10 2 J
# 9 3 I
# 8 4 H
# 7 5 G
# 6 6 F
# 5 7 E
# 4 8 D
# 3 9 C
# 2 10 B
# 1 11 A
Here you don't have to worry at all about whether your numbers are going to be coerced to characters and/or factors when you line them up with another vector that's character. In this example, whether or not labels was a factor had no impact on values.
Related
I want to save residuals from a linear model to a dataframe. I was trying to do it with the line of code (note that this was supposed to go inside a loop):
resi <- NULL
resi <- cbind(resi, colnames(dados[1])=residuals(m))
Here I intended to save the residuals vector from my model m under the same column name from the dados object (which is basicaly a date), but I get the error:
Error: unexpected '=' in "resi <- cbind(resi, colnames(dados[1])="
You want `colnames <- ()`.
cbind(d, `colnames<-`(d, letters[1:4]))
# X1 X2 X3 X4 a b c d
# 1 1 4 7 10 1 4 7 10
# 2 2 5 8 11 2 5 8 11
# 3 3 6 9 12 3 6 9 12
It's similar to setNames() but also compatible with matrices.
Toydata
d <- data.frame(matrix(1:12, 3, 4))
It is possible to do this in tibble
library(tibble)
tibble(resi, !!colnames(dados)[1] :=residuals(m))
I have a dataframe that gives the file numbers. But I would like to change the file numbers with filenames that I have in another vector/matrix. How can do it in r?
I am giving a reproducible matrix below:
> ash<-data.frame(matrix(c(4,2,NA,9,3,8,NA,NA,1,5,6,7),nrow=3, byrow=TRUE))
> ash2<-matrix(c("jegjgqe","hdd","odew","dhjs","ddj","hdiwhek","dij","jsosaeo"))
> ash
X1 X2 X3 X4
1 4 2 NA 9
2 3 8 NA NA
3 1 5 6 7
What I want is a matrix where the file number will be replaced by the names of ash2 matrix/vector. Like the value 1 would be "jegjgqe", value 2 would be "hdd" and so on. Is there any way to replace all values with those names at once?
We can use the indexing on a vector/matrix (matrix is a vector with dim attributes
ash[] <- ash2[as.matrix(ash)]
-output
ash
X1 X2 X3 X4
1 dhjs hdd <NA> <NA>
2 odew jsosaeo <NA> <NA>
3 jegjgqe ddj hdiwhek dij
Or use lapply to loop over the columns of the data.frame and replace the values based on the index to 'ash2' values
ash[] <- lapply(ash, function(x) ash2[x])
This question already has answers here:
Is it possible to swap columns around in a data frame using R?
(8 answers)
Closed 6 years ago.
When I try to partially reorder columns using "[", the values are swaped but the column's names do not move. See the example below:
x = data.frame(x1 = c(1,2,3), x2 = c(2,3,4), x3 = c("e","e","e"), x4 = c("f","f","f"))
x
#x1 x2 x3 x4
#1 2 e f
#2 3 e f
#3 4 e f
x[, c(3,4)] = x[, c(4,3)]
#x1 x2 x3 x4
#1 2 f e
#2 3 f e
#3 4 f e
Any idea as to why the column's names are not moving and how to simply solve this ?
Try this
x <- x[,c(1,2,4,3)]
One option is cbind
x1 <- cbind(x[1:2], x[4:3])
x1
# x1 x2 x4 x3
#1 1 2 f e
#2 2 3 f e
#3 3 4 f e
Or we can also use numeric ordering
By doing the assignment, we are changing only the values and not the column names. The column values does change by position, but it will not translate by swapping the column names as column name is fixed for that position.
I'm trying to explore a large dataset, both with data frames and with charts. I'd like to analyze the distribution of each variable by different metrics (e.g., sum(x), sum(x*y)) and for different sub-populations. I have 4 sub-populations, 2 metrics, and many variables.
In order to accomplish that, I've made a list structure such as this:
$variable1
...$metric1 <--- that's a df.
...$metric2
$variable2
...$metric1
...$metric2
Inside one of the data_frames (e.g., list$variable1$metric1), I've calculated distributions of the unique values for variable1 and for each of the four population groups (represented in columns). It looks like this:
$variable1$metric1
unique_values med_all med_some_not_all med_at_least_some med_none
1 (1) 12-17 Years Old NA NA NA NA
2 (2) 18-25 Years Old 0.278 0.317 0.278 0.317
3 (3) 26-34 Years Old 0.225 0.228 0.225 0.228
4 (4) 35 or Older 0.497 0.456 0.497 0.456
$variable1$metric2
unique_values med_all med_some_not_all med_at_least_some med_none
1 (1) 12-17 Years Old NA NA NA NA
2 (2) 18-25 Years Old 0.544 0.406 0.544 0.406
3 (3) 26-34 Years Old 0.197 0.310 0.197 0.310
4 (4) 35 or Older 0.259 0.284 0.259 0.284
What I'm trying to figure out is a good way to loop through the list of lists (probably melting the DFs in the process) and then output a ton of bar charts. In this case, the natural plot format would be, for each dataframe, a stacked bar chart with one stacked bar for each sub-population, grouping by the variable's unique values.
But I'm not familiar with iterated plotting and so I've hit a dead end. How might I plot from that list structure? Alternately, is there a better structure in which i should be storing this information?
I find nested lists to be pretty tricky to work with, so I would combine them all into a single data frame that labels the name of the variable and the name of the metric:
lst <- list(alpha= list(a= data.frame(matrix(1:4, 2)), b= data.frame(matrix(6:9, 2))), beta = list(c = data.frame(matrix(11:14, 2))))
level1 <- lapply(lst, function(x) do.call(rbind, lapply(names(x), function(y) {x[[y]]$metric=y ; x[[y]]})))
dat <- do.call(rbind, lapply(names(level1), function(x) {level1[[x]]$variable=x ; level1[[x]]}))
dat
# X1 X2 metric variable
# 1 1 3 a alpha
# 2 2 4 a alpha
# 3 6 8 b alpha
# 4 7 9 b alpha
# 5 11 13 c beta
# 6 12 14 c beta
Now you can use standard tools for manipulating a single data frame to perform your data analysis.
here's a start:
lst <- list(alpha= list(a= data.frame(matrix(1:4, 2)), b= data.frame(matrix(6:11, 2))),
beta = list(c = data.frame(matrix(11:14, 2))))
lst
$alpha
$alpha$a
X1 X2
1 1 3
2 2 4
$alpha$b
X1 X2 X3
1 6 8 10
2 7 9 11
$beta
$beta$c
X1 X2
1 11 13
2 12 14
#We can subset by number or by name
lst[['alpha']]
$a
X1 X2
1 1 3
2 2 4
$b
X1 X2 X3
1 6 8 10
2 7 9 11
lst[[1]]
$a
X1 X2
1 1 3
2 2 4
$b
X1 X2 X3
1 6 8 10
2 7 9 11
#The dollar sign naming convention reminds us that we are looking at a list.
#Let's sum the columns of both data frames in the alpha list
lapply(lst[['alpha']], colSums)
$a
X1 X2
3 7
$b
X1 X2 X3
13 17 21
Let's try to find the sum of each column of each data frame:
lapply(lst, colSums)
Error in FUN(X[[i]], ...) :
'x' must be an array of at least two dimensions
What happened? R is correctly refusing to run an array function on a list. The function colSums needs to be fed data frames, matrices, and other arrays above one-dimension. We have to nest an lapply function inside of another one. The logic can get complicated:
lapply(lst, function(x) lapply(x, colSums))
$alpha
$alpha$a
X1 X2
3 7
$alpha$b
X1 X2 X3
13 17 21
$beta
$beta$c
X1 X2
23 27
We can use rbind to put data.frames together:
rbind(lst$alpha$a, lst$beta$c)
X1 X2
1 1 3
2 2 4
3 11 13
4 12 14
Be sure not to do it the way you might be thinking (I've done it many times):
do.call(rbind, lst)
a b
alpha List,2 List,3
beta List,2 List,2
That isn't the result you're looking for. And make sure that the dimensions and column names are the same:
do.call(rbind, lst[[1]])
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
R is refusing to combine data frames that have 2 columns in one (alpha$a) and three columns in the other (alpha$b).
I changed the lst to make alpha$b have two columns like the others and combined them:
bind1 <- lapply(lst2, function(x) do.call(rbind, x))
bind1
$alpha
X1 X2
a.1 1 3
a.2 2 4
b.1 6 9
b.2 7 10
b.3 8 11
$beta
X1 X2
c.1 11 13
c.2 12 14
That combines the elements of each list. Now I can combine the outer list to make one big data frame.
do.call(rbind, bind1)
X1 X2
alpha.a.1 1 3
alpha.a.2 2 4
alpha.b.1 6 9
alpha.b.2 7 10
alpha.b.3 8 11
beta.c.1 11 13
beta.c.2 12 14
Here's a strategy based on melting a list (recursively),
lst = list(alpha= list(a= data.frame(matrix(1:4, 2)),
b= data.frame(matrix(6:11, 2))),
beta = list(c = data.frame(matrix(11:14, 2))))
library(reshape2)
m = melt(lst, id=1:2)
library(ggplot2)
ggplot(m, aes(X1,X2)) + geom_bar(stat="identity") + facet_grid(L1~L2)
Here is the sample dataframe.
I have a function which uses a for loop to go through a dataframe for a specified number of columns, remove NA values, remove duplicate values, then return the length of the final vector which has all the unique values present in the specified columns. The columns represent time, and the goal is to show how many total unique values have existed up until a certain point in time. Here's the sample matrix:
X1 X2 X3 X4 X5 X6
1 F F F F F F
2 C C C C C C
3 D D D D D D
4 A# A# A# A A A
5 <NA> <NA> <NA> G G <NA>
And here's the function:
uniquepitches <- function(file, col){
y <- read.csv(file, na.strings=c(""))
frame <- data.frame(y)
x <- c()
for(i in 1:col) {
noNAframe <- frame[!is.na(frame[, 1:i])]
x[i] <- length(unique(noNAframe))
}
x
}
The issue is that when I run it for any value for col, I get the wrong values. For example, uniquepitches("testnotes.csv", 1) gives me 5, which should be 4. uniquepitches("testnotes.csv", 6) gives me [1] 5 5 5 6 6 6, which should be [1] 4 4 4 6 6 6. So as of right now it looks like the x vector has one element too many in the first three run-throughs, which is why the length is one too many. How can I fix it so that it's the correct length?
This task can be accomplished with sapply():
df <- data.frame(X1=c('F','C','D','A#',NA), X2=c('F','C','D','A#',NA), X3=c('F','C','D','A#',NA), X4=c('F','C','D','A','G'), X5=c('F','C','D','A','G'), X6=c('F','C','D','A',NA) );
sapply(df, function(c) length(unique(c[!is.na(c)])) );
## X1 X2 X3 X4 X5 X6
## 4 4 4 5 5 4
Edit: #Molx might be correct, although the OP needs to clarify to be sure. If the requirement is indeed to process the cumulative column content, rather than each individual column in isolation, then you can do this:
sapply(1:ncol(df), function(c) length(unique(df[,1:c][!is.na(df[,1:c])])) );
## [1] 4 4 4 6 6 6
Edit: Sorry, I should've been clearer. The sapply() call replaces the entire for loop. So the function can be rewritten as follows:
uniquepitches <- function(file,col) {
frame <- read.csv(file,na.strings=c(""));
sapply(1:col, function(c) length(unique(frame[,1:c][!is.na(frame[,1:c])])) );
}
(Also notice that read.csv() returns a data.frame, so there's no need for manual coercion.)