This is likely a very simple question, maybe even already answered, but I couldn't find it.
I have two data frames. For simplicity, I'll call them all and index. all will have many columns, one of which I'll call match. index also has many (other) columns, including match. I want to have the rows from all that have a match value that matches match in index. Therefore, as a dummy data:
all <- data.frame(match = sample(LETTERS, 20),
otherStuff = rnorm(20))
index <- data.frame(match = sample(LETTERS, 20),
moreStuff = rnorm(20))
I've tried the following:
all[all$match == index$match,]
But this works very badly:
> all[all$match == index$match,]
match otherStuff
3 C -0.6030772
Warning message:
In all$match == index$match :
longer object length is not a multiple of shorter object length
As you can see, it gives this warning and, it seems, only considers matches in the same position (in this case, C was the third element in both cases).
After that, I tried something similar with filter(), from dplyr, just to give it a chance. As I expected, it didn't work. I tried some other things (less logical, actually a little to desperate and nonsense to even comment here), all of which seemed crazy from the getgo. I really don't know what to do here...
Another solution is by using match:
Data:
set.seed(123)
all <- data.frame(match = sample(LETTERS, 10),
otherStuff = rnorm(10))
index <- data.frame(match = sample(LETTERS, 10),
moreStuff = rnorm(10))
Solution:
all[match(index$match, all$match, nomatch = 0),]
match otherStuff
10 Z -0.5558411
5 W -0.4456620
8 Q 0.4007715
6 A 1.2240818
7 K 0.3598138
Related
I'm working mainly off of this post to conditionally format some numbers I have in two columns. Here's a dataframe:
d <- data.frame(ID = c("a","b","c"),
total = c(145000000, 9.5, 867455),
se = c(9000000,0.84,120835),stringsAsFactors=FALSE)
I'm trying to apply the suffix "M" for numbers above 100,000 but leave smaller numbers alone for this formatting.
From that earlier post, I've tried to adapt the solution like so:
d[,2:3] <- ifelse(d[,2:3] > 100000, paste(round(d[,2:3] / 1e6, 1), trim = TRUE), "M")
But either nothing happens when I run the code (d remains unchanged) or I get this error: non-numeric argument to binary operator
For reference, I'd like the dataframe to look like this:
d <- data.frame(ID = c("a","b","c"),
total = c("1.45M", 9.5, "0.867M"),
se = c("9.0M",0.84,"0.12M"),stringsAsFactors=FALSE)
I don't mind if the results are character-formatted, as this is going to be output to a table in LaTeX and I won't be doing operations on these numbers.
Thanks!
This function will convert a numeric value to that format, then just use sapply to apply it to each column of the data.frame
milfun <- function(x) {ifelse(x>1e5,paste0(round(x/1e6,3),"M"),x)}
d$total <- sapply(d$total,milfun)
d$se <- sapply(d$se,milfun)
d
ID total se
1 a 145M 9M
2 b 9.5 0.84
3 c 0.867M 0.121M
note you had 145000000=145M, not 1.45M in your desired response.
I have a df containing 3 variables, and I want to create an extra variable for each possible product combination.
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
I want to create a new df (output) containing the result of a*b, a*c, b*c.
output <- data.frame(d = test$a * test$b
, e = test$a * test$c
, f = test$b * test$c)
This is easily doable (manually) with a small number of columns, but even above 5 columns, this activity can get very lengthy - and error-prone, when column names contain prefix, suffix or codes inside.
It would be extra if I could also control the maximum number of columns to consider at the same time (in the example above, I only considered 2 columns, but it would be great to select that parameter too, so to add an extra variable a*b*c - if needed)
My initial idea was to use expand.grid() with column names and then somehow do a lookup to select the whole columns values for the product - but I hope there's an easier way to do it that I am not aware of.
You can use combn to create combination of column names taken 2 at a time and multiply them to create new columns.
cbind(test, do.call(cbind, combn(names(test), 2, function(x) {
setNames(data.frame(do.call(`*`, test[x])), paste0(x, collapse = '-'))
}, simplify = FALSE)))
#. a b c a-b a-c b-c
#1 0.4098568 -0.3514020 2.5508854 -0.1440245 1.045498 -0.8963863
#2 1.4066395 0.6693990 0.1858557 0.9416031 0.261432 0.1244116
#3 0.7150305 -1.1247699 2.8347166 -0.8042448 2.026909 -3.1884040
#4 0.8932950 1.6330398 0.3731903 1.4587864 0.333369 0.6094346
#5 -1.4895243 1.4124826 1.0092224 -2.1039271 -1.503261 1.4255091
#6 0.8239685 0.1347528 1.4274288 0.1110321 1.176156 0.1923501
#7 0.7803712 0.8685688 -0.5676055 0.6778060 -0.442943 -0.4930044
#8 -1.5760181 2.0014636 1.1844449 -3.1543428 -1.866707 2.3706233
#9 1.4414434 1.1134435 -1.4500410 1.6049658 -2.090152 -1.6145388
#10 0.3526583 -0.1238261 0.8949428 -0.0436683 0.315609 -0.1108172
Could this one also be a solution. Ronak's solution is more elegant!
library(dplyr)
# your data
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
# new dataframe output
output <- test %>%
mutate(a_b= prod(a,b),
a_c= prod(a,c),
b_c= prod(b,c)
) %>%
select(-a,-b,-c)
I have come across a typical problem. I have been using a line of code in R as follows:
myfiles3a <- lapply(myfiles3, function(x) {
x$CHINA2 <- rowSums(x[,grep("China", names(x))], na.rm = T); x
})
It gives me perfect result flawlessly applied since I wrote it. But, today when I just wanted to the same thing for another country Japan or Russia. The code gives an error. like Error in rowSums(x[, grep("Russia", names(x))], na.rm = T) :
'x' must be an array of at least two dimensions
I am totally clueless why it's happening. My new line of code is as follows.
myfiles3c <- lapply(myfiles3, function(x) {x$RUSSIA2 <- rowSums(x[,grep("Russia", names(x))], na.rm = T); x})
I am unable to find where I have gone wrong in the two lines of code.
By going through the error, the crux of the issue is the ?Extract behavior which uses drop = TRUE by default. What happens is that when we specify ,, it gets coerced from data.frame to vector when there is a single column. Consider the following the example, where there is only a single column with 'Russia' in the names
df1 <- data.frame(col1 = 1:5, col2 = 6:10)
df1$RussiaCol <- 1:5
rowSums(df1[,grep("Russia", names(df1))], na.rm = TRUE)
Error in rowSums(df1[, grep("Russia", names(df1))], na.rm = TRUE) :
'x' must be an array of at least two dimensions
Now, let's check the issue
df1[,grep("Russia", names(df1))]
#[1] 1 2 3 4 5
returns a vector due to the default behavior
df1[,grep("Russia", names(df1)), drop = FALSE]
# RussiaBot
#1 1
#2 2
#3 3
#4 4
#5 5
Or without using any ,, by default it takes the index as the column index
df1[grep("Russia", names(df1))]
According to ?rowSums
x - an array of two or more dimensions, containing numeric, complex, integer or logical values, or a numeric data frame. For .colSums() etc, a numeric, integer or logical matrix (or vector of length m * n).
So, it won't take a vector
In the list, if we remove the , it should work fine
lapply(myfiles3, function(x) {
x$RUSSIA2 <- rowSums(x[grep("Russia", names(x))], na.rm = T)
x})
Just learning dplyr (and R) and I do not understand why this fails or what the correct approach to this is. I am looking for a general explanation rather than something specific to this contrived dataset.
Assume I have 3 files sizes with multipliers and I'd like to combine them into a single numeric column.
require(dplyr)
m <- data.frame(
K = 1E3,
M = 1E6,
G = 1E9
)
s <- data.frame(
size = 1:3,
mult = c('K', 'M', 'G')
)
Now I want to multiply the size by it's multiplier so I tried:
mutate(s, total = size * m[[mult]])
#Error in .subset2(x, i, exact = exact) :
# recursive indexing failed at level 2
which throws an error. I also tried:
mutate(s, total = size * as.numeric(m[mult]))
#1 1 K 1e+06
#2 2 M 2e+09
#3 3 G 3e+03
which is worse than an error (wrong answer)!
I tried a lot of other permutations but could not find the answer.
Thanks in advance!
Edit:
(or should this be another question)
akrun's answer worked great and I thought I understood but if I
rbind(s, c(4, NA))
then update the mutate to
mutate(s, total = size *
ifelse(is.na(mult), 1,
unlist(m[as.character(mult)])
it falls apart again with an "undefined columns selected"
The 'mult' column is 'factor' class. Convert it to 'character' for subsetting the 'm', `unlist' and then multiply with 'size'
mutate(s, new= size*unlist(m[as.character(mult)]))
# size mult new
#1 1 K 1e+03
#2 2 M 2e+06
#3 3 G 3e+09
If we look at how the 'factor' columns act based on the 'levels'
m[s$mult]
# M G K
#1 1e+06 1e+09 1000
We get the same order of output by using match between the names(m) and levels(s$mult)
m[match(names(m), levels(s$mult))]
# M G K
#1 1e+06 1e+09 1000
So, this might be the reason why you got a different result
If you don't mind changing the data structure of m, you could use
# change m to a table
m = as.data.frame(t(m))
m$mult = rownames(m)
colnames(m)[which(colnames(m) == "V1")] = "value"
# to avoid indexing
s %>%
inner_join(m) %>%
mutate(total = size*value) %>%
select(size, mult, total)
to keep things more dplyr based.
EDIT: Though it works, you may need to be a little bit careful about the data types in the columns though
I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]