Whilst reviewing a colleague's Stata code I came across the command expand.
I would really love to be able to do the same thing simply in my own R code.
Essentially expand duplicates a dataset n times but has the option to create a new variable which is 0 if the observation originally appeared
in the dataset and 1 if the observation is a duplicate.
Does anyone know of a quick way of implementing this in R? Or is it a case of writing my own function?
rep_r<-function(x,n){if(n<=1){rep(x,times=1)}else{rep(x,times=n)}}
expand_r<-function(x,n){
Reduce(function(x,y)
{c(x,y)},mapply(rep_r,x,n))
}
expand_r(c(2,3,4,1,5),c(-1,0,1,2,3))
#[1] 2 3 4 1 1 5 5 5
EDIT: Thanks to the suggestion from #nicola the above functionality can be simply achieved by the following one-liner.
expand_r<-function(x,n) rep(x,replace(n,n<1,1))
#>expand_r(c(2,3,4,1,5),c(-1,0,1,2,3))
#[1] 2 3 4 1 1 5 5 5
This function expands the rows of a data.frame like the Stata expand command does. I got the idea from the R mefa package.
expand_r <- function(df, ...) {
as.data.frame(lapply(df, rep, ...))
}
df <- data.frame(x = 1:2, y = c("a", "b"))
expand_r(df, times = 3)
Related
I struggled to find a good title for this post, so apologies if this isn't the best to describe what I am asking here. Let's say I have a data frame with 50 columns, and I want to split it up by 2 columns, resulting in a list of 25 data frames.
col_one col_two col_three col_four col_five ... col_fifty
1 1 1 1 1 1
2 2 2 2 2 2
One way I've been able to solve this is through map function and the pmin() function as such:
purrr::map(seq(1, ncol(data), by = 2), ~ data[.x:pmin((.x + 1), ncol(data))])
However, the more I thought about this, the more I thought that I could simply drop the pmin() function and do the same task like this:
purrr::map(seq(1, ncol(data), by = 2), ~ data[, .x:.x+1])
This does create list of 25 data.frames, but it only prints out one column rather than two columns.
Could anybody explain why? Thanks!
This question already has answers here:
Why is it not advisable to use attach() in R, and what should I use instead?
(4 answers)
Closed 6 years ago.
I have a data frame in R but want all of the variables in the data frame to be single variables in my working space instead. So, I am looking for a command where I just use command(df) in the code below and I have Var_A, Var_B, Var_C in my working space.
data <- 1:12
df <- data.frame(matrix(data, ncol = 3))
names(df) <- c("Var_A", "Var_B", "Var_C")
df
> df
Var_A Var_B Var_C
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
EDIT: My question is not an exact duplicate of the suggested question. The suggested question asks why it is not good to do what I want. There is a difference in asking how to do something and why doing something might be bad. Moreover, I don't understand the downvotes. I stated a clear question with a reproucible code sample. Instead of voting me down because one thinks it is a bad thing what I want to do, one could simply answer and propose an alternative.
Hi if you want to do that you can do :
list2env(x = df, envir = .GlobalEnv)
Var_A
# [1] 1 2 3 4
ls()
# [1] "data" "df" "Var_A" "Var_B" "Var_C"
EDIT you should probably keep your variables in your data.frame, or at least in a list, there's probably an other way to do what you want to do with data.frame.
Not sure why you would want to do that, but here it goes:
decomposeFrame <- function(df){
vars <- colnames(df)
sapply(vars, function(x){
eval(parse(text = paste0(x, "<<- ", "df$", x)))
})
}
my problem is that I can't really get my problem down in words which makes it hard to google it, so I am forced to ask you. I hope you will shed light on my issue:
I got a data.frame like this:
6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1
As you noticed, in the first column I got 0 repeating two times, 1 two times and so one. What I would like to do is get get all the corresponging values for one number, say 0, in the second columns (in this example 7 and 2). Preferably in data.frame.
I know the attempt with df$V2[which(df$V1==0)], however since the first column might have over 100 rows I can't really use this. Do you guys have a good solution?
Maybe some words regarding the background of this question: I need to process this data, i.e. get the mean of the second column for all 0's in the first columns, or get min/max values.
Regards
Here a solution using dplyr
df %>% group_by(V1) %>% summarize(ME=mean(V2))
Using your data (with some temporary names attached)
txt <- "6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1"
df <- read.table(text = txt)
names(df) <- paste0("Var", seq_len(ncol(df)))
Coerce the first column to be a factor
df <- transform(df, Var1 = factor(Var1))
Then you can use aggregate() with a nice formula interface
aggregate(Var2 ~ Var1, data = df, mean)
aggregate(Var2 ~ Var1, data = df, max)
aggregate(Var2 ~ Var1, data = df, min)
(eg:
> aggregate(Var2 ~ Var1, data = df, mean)
Var1 Var2
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
) or using the default interface
with(df, aggregate(Var2, list(Var1), FUN = mean))
> with(df, aggregate(Var2, list(Var1), FUN = mean))
Group.1 x
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
But the output is nicer from the formula interface.
Using data.table
library(data.table)
setDT(df)[, list(mean=mean(V2), max= max(V2), min=min(V2)), by = V1]
First, what exactly is the issue with the solution you suggest? Is it a question of efficiency? Frankly the code you present is close to optimal [1].
For the general case, you're probably looking at a split-apply-combine action, to apply a function to subsets of the data based on some differentiator. As #teucer points out, dplyr (and it's ancestor, plyr) are designed for exactly this, as is data.tables. In vanilla R, you would tend to use by or aggregate (or split and sapply for more advanced usage) for the same task. For example, to compute group means, you would do
by(df$V2, df$V1, mean)
or
aggregate(df, list(type=df$V1), mean)
Or even
sapply(split(df$V2, df$V1), mean)
[1] The code can be simplified to df$V2[df$V1 == 0] or df[df$V1 == 0,] as well.
Thanks all for your replies. I decided to go for the dplyr solution posted by teucer and eipi10. Since I have a third (and maybe even a fourth) column, this solution seems to be pretty easy to use (just adding V3 to group_by).
Since some are asking what's wrong with df$V2[which(df$V1==0)]: I maybe was a bit unclear when saying "rows", was I actually meant was "values". Let's assume I had n distinct values in the first column, I would have to use the command n times for all distinct values and store the n resulting vectors.
I'm quite new to R and I would like to learn how to write a Loop to create and process several columns.
I imported a table into R that cointains data with 23 variables. For all of these variables I want to calculate the per capita valuem multiply this with 1000 and either write the data into a new table or in the same table as the old data.
So to this for only one column my operation looked like this:
<i>agriculture<-cbind(agriculture,"Total_value_per_capita"=agriculture$Total/agriculture$Total.Population*1000)</i>
Now I'm asking how to do this in a Loop for the 23 variables so that I won't have to write 23 similar lines of code.
I think the solution might look quite similar to the code pasted in this thread:
loop to create several matrix in R (maybe using paste)
but I dind't got it working on my code.
So any suggestion would be very helpful.
I would always favor an appropriate *ply function over loops in R. In this case sapply could be your friend:
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
df.per.capita <– as.data.frame(
sapply(
df[ colnames(df) != "c" ], function(x){ x/df$c *1000 }
)
)
For more complicated cases, you should definitely have a look at the plyr package.
This can be done using sweep function. Using Beasterfield's data generation but setting the seed you can obtain the same results
set.seed(001)
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
per.capita <- sweep(df[,colnames(df) != "c"], 1, STATS=df$c, FUN='/')*1000
per.capita
a b
1 300.0000 300.0000
2 2000.0000 1000.0000
3 833.3333 1000.0000
4 7000.0000 10000.0000
5 222.2222 555.5556
6 1000.0000 875.0000
7 1285.7143 1142.8571
8 1200.0000 800.0000
9 3333.3333 333.3333
10 250.0000 2250.0000
Comparing with Beasterfield's results:
all.equal(df.per.capita, per.capita)
[1] TRUE
Could someone please point to how we can apply multiple functions to the same column using tapply (or any other method, plyr, etc) so that the result can be obtained in distinct columns). For eg., if I have a dataframe with
User MoneySpent
Joe 20
Ron 10
Joe 30
...
I want to get the result as sum of MoneySpent + number of Occurences.
I used a function like --
f <- function(x) c(sum(x), length(x))
tapply(df$MoneySpent, df$Uer, f)
But this does not split it into columns, gives something like say,
Joe Joe 100, 5 # The sum=100, number of occurrences = 5, but it gets juxtaposed
Thanks in advance,
Raj
You can certainly do stuff like this using ddply from the plyr package:
dat <- data.frame(x = rep(letters[1:3],3),y = 1:9)
ddply(dat,.(x),summarise,total = NROW(piece), count = sum(y))
x total count
1 a 3 12
2 b 3 15
3 c 3 18
You can keep listing more summary functions, beyond just two, if you like. Note I'm being a little tricky here in calling NROW on an internal variable in ddply called piece. You could have just done something like length(y) instead. (And probably should; referencing the internal variable piece isn't guaranteed to work in future versions, I think. Do as I say, not as I do and just use length().)
ddply() is conceptually the clearest, but sometimes it is useful to use tapply instead for speed reasons, in which case the following works:
do.call( rbind, tapply(df$MoneySpent, df$User, f) )