I don´t know how or where to start, but i hope someone can help. It´s the first time i´d use R like this, so even a keyword or a recommendation where to look it up would be helpful.
My dataframe looks like this:
set.seed(1)
df <- data.frame(
X = sample(c(1, 2, 3), 50, replace = TRUE),
Y = sample(c(1, 2, 3), 50, replace = TRUE))
And I would like to get a cross table like this:
using
length(which(df$X == & df$Y == ))
I could calculate the data with R and fill it in my Excel-sheet but there has to be a better option.
Thank you in advance.
Try this base R solution:
#Data
set.seed(1)
df <- data.frame(
X = sample(c(1, 2, 3), 50, replace = TRUE),
Y = sample(c(1, 2, 3), 50, replace = TRUE))
#Code
addmargins(table(df$X,df$Y))
Output:
1 2 3 Sum
1 6 7 5 18
2 4 6 9 19
3 5 5 3 13
Sum 15 18 17 50
You can also change the order of your variables like this:
#Code2
addmargins(table(df$Y,df$X))
Output:
1 2 3 Sum
1 6 4 5 15
2 7 6 5 18
3 5 9 3 17
Sum 18 19 13 50
In order to export to MS Excel, you use this code:
library(xlsx)
#Transform to dataframe
d1 <- as.data.frame.matrix(addmargins(table(df$X,df$Y)))
#Export
write.xlsx(d1,file='myexample.xlsx','Sheet1')
If the data have only two columns, just pass the data.frame object to table.
addmargins(table(df))
If the data include more than two columns, you can subset it's variable before passing to table().
addmargins(table(df[c("X", "Y")]))
You can also pass a formula to xtabs().
addmargins(xtabs( ~ X + Y, df))
All of above give
Y
X 1 2 3 Sum
1 5 6 3 14
2 2 6 6 14
3 13 4 5 22
Sum 20 16 14 50
To export the table to an excel file, you can use write.xlsx() from openxlsx.
library(openxlsx)
tab <- addmargins(xtabs( ~ X + Y, df))
write.xlsx(tab, "foo.xlsx")
Related
I have a data frame in R that I want to aggregate. The summary function that I want to apply to each subset is a custom function that takes several variables (columns) as input, and returns a vector or list of variable length. As an output, I would like to have a data frame with a column of the grouping variable, and a single other column containing the output vector (of varying length).
To give a mock example, suppose I have the following dataframe:
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
> df
particle time state energy
1 X 1 A 9
2 X 2 A 8
3 X 3 B 7
4 X 4 C 5
5 X 5 A 0
6 Y 1 A 1
7 Y 2 B 7
8 Y 3 B 7
9 Z 1 B 3
10 Z 2 C 9
11 Z 3 A 5
12 Z 4 A 6
I would like to obtain for each particle a list of the energy they had every time they changed state. The output I'm looking for is something like this:
>
particle energy
1 X c(9,7,5,0)
2 Y c(1,7)
3 Z c(3,9,5)
To do so, I would define a function like the following:
myfun <- function(state, energy){
tempstate <- state[1]
energyvec <- energy[1]
for(i in 2:length(state)){
if(state[i] != tempstate){
energyvec <- c(energyvec, energy[i])
tempstate <- state[i]
}
}
return(energyvec)
}
And try to pass it to aggregate somehow
The two data structures I tried for this are data.frame and data.table.
In data.frame, using a custom function that returns a vector seems to give the correct output format I am looking for, that is where the output column is really a list, and each row contains a list with the output of the function. However, I can't seem to pass several columns to the function when aggregating this way.
With a data.table, the aggregation is easier to do when considering a function of several variables. However, I can't seem to obtain the output I'm looking for. Indeed,
dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]
only returns the first element of energyvec (instead of a vector), and
dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]
doesn't work as the outputs don't all have the same length.
Is there an alternative way to go to accomplish this?
Thank you very much in advance for all your help!
Here's a tidyverse approach:
library(tidyverse)
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
df %>%
group_by(particle) %>%
mutate(
changed_state = coalesce(state != lag(state, 1), TRUE)
) %>%
filter(changed_state) %>%
summarise(
string = toString(energy)
)
#> # A tibble: 3 x 2
#> particle string
#> <fct> <chr>
#> 1 X 9, 7, 5, 0
#> 2 Y 1, 7
#> 3 Z 3, 9, 5
I'd run each line of the pipe individually. Basically, create a changed_state variable by checking if the "this" state matches the last state lag(state, 1). Since we only care when this happens, we filter where this is TRUE (a more verbose line would be filter(changed_state == TRUE). The toString function collapses the rows of energy as desired and we are already "grouped" by particle.
data.table approach
sample data
#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
code
library( data.table )
#create data.table
dt <- as.data.table(df)
#use `uniqlist` to get rownumbers where the value of `state` changes,
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]
#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
#
# $Y
# [1] 1 7
#
# $Z
# [1] 3 9 5
#craete final output
data.table( particle = names(l), energy = l )
# particle energy
# 1: X 9,7,5,0
# 2: Y 1,7
# 3: Z 3,9,5
Another possible data.table approach
library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]
output:
particle energy
1: X 9,4,6,9
2: Y 2,9
3: Z 7,6,1
data:
set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
# particle time state energy
# 1 X 1 A 9
# 2 X 2 A 3
# 3 X 3 B 4
# 4 X 4 C 6
# 5 X 5 A 9
# 6 Y 1 A 2
# 7 Y 2 B 9
# 8 Y 3 B 9
# 9 Z 1 B 7
# 10 Z 2 C 6
# 11 Z 3 A 1
# 12 Z 4 A 2
If I have 5 data frames in the global environment, such as a,b,c,d,and e
I want the data frame a to be compared with e, and if R finds any common elements in a and e, delete the elements in a. then I want the data frame b to be compared with e and delete the common elements, and so on.
Actually I have 20 tables need to be compared with e.
Can anyone give some elegant way to handle this problem? I'm thinking of loop or functions but can't work the details out.
Thanks everybody and have a nice day!
The easiest would be to put all the dataframes you want to compare in a list, then use lapply to loop over this list:
# create list of data.frames
dlist <- list(df1 = data.frame(var1 = 1:10), df2 = data.frame(var1 = 11:20),
df3 = data.frame(var1 = 21:30), df4 = data.frame(var1 = 31:40))
# create master-data.frame
set.seed(1)
df <- data.frame(var1 = sample(1:100, 30))
# use lapply() to loop over the data and exclude all elements that are in the master-data.frame
dlist <- lapply(dlist, function(x){
x <- x[!x$var1 %in% df$var1, , drop = FALSE]
})
Result:
> dlist
$df1
var1
2 2
3 3
4 4
5 5
7 7
8 8
9 9
$df2
var1
1 11
2 12
3 13
4 14
5 15
8 18
$df3
var1
2 22
3 23
4 24
6 26
10 30
$df4
var1
1 31
3 33
5 35
6 36
8 38
9 39
10 40
If you absolutely need the dataframes in your global directory, you could use list2env:
list2env(dlist, envir = .GlobalEnv)
I Have just started learning R using RStudio and I have, perhaps, some basic questions.
One of them regards the "sample" function.
More specifically, my dataset consists of 402224 observations of 147 variables. My task is to take a sample of 50 observations and then produce a dataframe and so on.
But when the function sample is executed
y = sample(mydata, 50, replace = TRUE, prob = NULL)
the result is a dataset with 40224 observations of 50 variables. That is, the sampling is done at variables and not obesrvations.
Do you have any idea why does it happen?
Thank you in advance.
If you want to create a data frame of 50 observations with replacement from your data frame, you can try:
mydata[sample(nrow(mydata), 50, replace=TRUE), ]
Alternatively, you can use the sample_n function from the dplyr package:
sample_n(mydata, 50)
The other answers people have been giving are to select rows, but it looks like you are after columns. You can still accomplish this in a similar way.
Here's a sample df.
df = data.frame(a = 1:5, b = 6:10, c = 11:15)
> df
a b c
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
Then, to randomly select 2 columns and all observations we could do this
> df[ , sample(1:ncol(df), 2)]
c a
1 11 1
2 12 2
3 13 3
4 14 4
5 15 5
So, what you'll want to do is something like this
y = mydata[ , sample(1:ncol(mydata), 50)]
That is because sample accepts only vectors.
try the following:
library(data.table)
set.seed(10)
df_sample<- data.table(df)
df[sample(.N, 402224 )]
Sample data:
a <- sample(1:4, 100, replace = T)
b <- sample(0:1, 100, replace = T)
d <- data.frame(a, b)
I want to achieve this output automatically for all levels of a:
table(d$b[d$a==1])
table(d$b[d$a==2])
table(d$b[d$a==3])
table(d$b[d$a==4])
I could do a for-loop, but that is not in the spirit of R.
for (i in unique(d$a)) {
print(table(d$b[d$a==i]))
}
Rather, I want to use one of the many list-function in R.
I tried to use ddply from the plyr package:
ddply(d, ~a, function(x) table(b))
But that is just the same as table(d$b) repeated four times.
How do I apply the table() function to each group in a using a list-function, preferably ddply?
You can use table with multiple arguments:
table(d$a,d$b)
0 1
1 15 10
2 6 16
3 13 10
4 20 10
Or, if you only have the data you want to tabulate in the data.frame, it will handle it for you if you pass in the data.frame:
table(d)
b
a 0 1
1 15 10
2 6 16
3 13 10
4 20 10
I am trying to calculated the lagged difference (or actual increase) for data that has been inadvertently aggregated. Each successive year in the data includes values from the previous year. A sample data set can be created with this code:
set.seed(1234)
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
(df <- rbind(x, y, z))
I can use a combination of lapply() and split() to calculate the difference between each year for every unique id, like so:
(diffs <- lapply(split(df, df$id), function(x){-diff(x$value)}))
However, because of the nature of the diff() function, there are no results for the values in year 1, which means that after I flatten the diffs list of lists with Reduce(), I cannot add the actual yearly increases back into the data frame, like so:
df$actual <- Reduce(c, diffs) # flatten the list of lists
In this example, there are only 10 calculated differences or lags, while there are 15 rows in the data frame, so R throws an error when trying to add a new column.
How can I create a new column of actual increases with (1) the values for year 1 and (2) the calculated diffs/lags for all subsequent years?
This is the output I'm eventually looking for. My diffs list of lists calculates the actual values for years 2 and 3 just fine.
id value year actual
1 21 3 5
2 26 3 16
3 26 3 14
4 26 3 10
5 29 3 14
1 16 2 10
2 10 2 5
3 12 2 10
4 16 2 7
5 15 2 13
1 6 1 6
2 5 1 5
3 2 1 2
4 9 1 9
5 2 1 2
I think this will work for you. When you run into the diff problem just lengthen the vector by putting 0 in as the first number.
df <- df[order(df$id, df$year), ]
sdf <-split(df, df$id)
df$actual <- as.vector(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2]))))
df[order(as.numeric(rownames(df))),]
There's lots of ways to do this but this one is fairly fast and uses base.
Here's a second & third way of approaching this problem utilizing aggregate and by:
aggregate:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1])))
df[order(as.numeric(rownames(df))),]
by:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- unlist(by(df$value, df$id, diff2))
df[order(as.numeric(rownames(df))),]
plyr
df <- df[order(df$id, df$year), ]
df <- data.frame(temp=1:nrow(df), df)
library(plyr)
df <- ddply(df, .(id), transform, actual=diff2(value))
df[order(-df$year, df$temp),][, -1]
It gives you the final product of:
> df[order(as.numeric(rownames(df))),]
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2
EDIT: Avoiding the Loop
May I suggest avoiding the loop and turning what I gave to you into a function (the by solution is the easiest one for me to work with) and sapply that to the two columns you desire.
set.seed(1234) #make new data with another numeric column
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
df <- rbind(x, y, z)
df <- df.rep <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df),
replace=T), year=df[, 3])
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- data.frame(df, sapply(df[, 2:3], group.diff)) #apply group.diff to col 2:3
df[order(as.numeric(rownames(df))),] #reorder it
Of course you'd have to rename these unless you used transform as in:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var))
df[order(as.numeric(rownames(df))),]
This would depend on how many variables you were doing this to.
1) diff.zoo. With the zoo package its just a matter of converting it to zoo using split= and then performing the diff :
library(zoo)
zz <- zz0 <- read.zoo(df, split = "id", index = "year", FUN = identity)
zz[2:3, ] <- diff(zz)
It gives the following (in wide form rather than the long form you mentioned) where each column is an id and each row is a year minus the prior year:
> zz
1 2 3 4 5
1 6 5 2 9 2
2 10 5 10 7 13
3 5 16 14 10 14
The wide form shown may actually be preferable but you can convert it to long form if you want that like this:
dt <- function(x) as.data.frame.table(t(x))
setNames(cbind(dt(zz), dt(zz0)[3]), c("id", "year", "value", "actual"))
This puts the years in ascending order which is the convention normally used in R.
2) rollapply. Also using zoo this alternative uses a rolling calculation to add the actual column to your data. It assumes the data is structured as you show with the same number of years in each group arranged in order:
df$actual <- rollapply(df$value, 6, partial = TRUE, align = "left",
FUN = function(x) if (length(x) < 6) x[1] else x[1]-x[6])
3) subtraction. Making the same assumptions as in the prior solution we can further simplify it to just this which subtracts from each value the value 5 positions hence:
transform(df, actual = value - c(tail(value, -5), rep(0, 5)))
or this variation:
transform(df, actual = replace(value, year > 1, -diff(ts(value), 5)))
EDIT: added rollapply and subtraction solutions.
Kind of hackish but keeping in place your wonderful Reduce you could add mock rows to your df for year 0:
mockRows <- data.frame(id = 1:5, value = 0, year = 0)
(df <- rbind(df, mockRows))
(df <- df[order(df$id, df$year), ])
(diffs <- lapply(split(df, df$id), function(x){diff(x$value)}))
(df <- df[df$year != 0,])
(df$actual <- Reduce(c, diffs)) # flatten the list of lists
df[order(as.numeric(rownames(df))),]
This is the output:
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2