transform data frame to another form - r

I'm new in [r]. And recently i'm stuck in how to perform operation in data.frame.
Now I have a data.frame called frame. And I want to transform it to another form.
> frame
A B Freq total
1 0 0 75 110
2 1 0 21 110
3 0 1 8 110
4 1 1 6 110
the expected form is:
> frame(B=1)
A Freq total
1 0 8 83
2 1 6 27
Can anyone give some suggestions? Thanks

One option would be using dplyr. We group by 'A', and create a new column 'total' as the sum of "Freq", filter the rows where 'B' = 1, and select all other columns except 'B'
library(dplyr)
frame %>%
group_by(A) %>%
mutate(total= sum(Freq)) %>%
filter(B==1)%>%
select(-B)
# A Freq total
#1 0 8 83
#2 1 6 27
Or using data.table, we convert the data.frame to data.table (setDT(frame) or we can do as.data.table(frame)), create a new column total as the sum of 'Freq', grouped by 'A', subset the rows with B=1, and remove the 'B' column by assigning it to NULL.
library(data.table)
setDT(frame)[, total:= sum(Freq), A][B==1][,B:=NULL]
# A Freq total
#1: 0 8 83
#2: 1 6 27
Or using base R, we create the 'total' using transform/ave and then subset the rows that are 1 for 'B'.
subset(transform(frame, total=ave(Freq, A, FUN=sum)), B==1, select=-B)
# A Freq total
#3 0 8 83
#4 1 6 27

Below is an example using functions in the base package - aggregate() and merge().
frame <- read.table(header = T, text = "
A B Freq total
1 0 0 75 110
2 1 0 21 110
3 0 1 8 110
4 1 1 6 110")
# obtain sum by column A
frame1 <- aggregate(frame$Freq, by = list(frame$A), sum)
names(frame1) <- c("A", "total")
# merge Freq
frame2 <- merge(frame1, frame[frame$B == 1, c(1,3)], by="A")
# A total Freq
#1 0 83 8
#2 1 27 6

Related

Drawing multiple barplots on a graph using data with different size

I can plot multiple bar plots on one plot with following code (taken from this question):
mydata <- data.frame(Barplot1=rbinom(5,16,0.6), Barplot2=rbinom(5,16,0.25),
Barplot3=rbinom(5,5,0.25), Barplot4=rbinom(5,16,0.7))
barplot(as.matrix(mydata), main="Interesting", ylab="Total", beside=TRUE,
col=terrain.colors(5))
legend(13, 12, c("Label1","Label2","Label3","Label4","Label5"), cex=0.6,
fill=terrain.colors(5))
But my scenario is a bit different: I have data stored in 3 data.frames (sorted according to V2 column) where V1 column is the Y axis and V2 column is the X axis:
> tail(hist1)
V1 V2
67 2 70
68 2 72
69 1 73
70 2 74
71 1 76
72 1 84
> tail(hist2)
V1 V2
87 1 92
88 3 94
89 1 95
90 2 96
91 1 104
92 1 112
> tail(hist3)
V1 V2
103 3 110
104 1 111
105 2 112
106 2 118
107 2 120
108 1 138
For plotting one single plot it is just simple as:
barplot(hist3$V1, main="plot title", names.arg = hist3$V2)
But I cannot construct the matrix needed for plot because of several problems that I can see right now (maybe there are several others):
My data has different size:
> nrow(hist1)
[1] 72
> nrow(hist2)
[1] 92
> nrow(hist3)
[1] 108
There are X (and therefore Y also) values which are in one list but not in another list e.g.:
> hist3$V2[which(hist3$V2==138)]
[1] 138
> hist1$V2[which(hist1$V2==138)]
integer(0)
What I need (I guess) is something that will create appropriate V2 (x axis) with 0 Y value in appropriate data.frame so they will have same length and I will be able combine them as above example. See following example with only 2 data.frames (v2 and v1 are reversed as in previous example):
> # missing v2 for 3,4,5
> df1
v2 v1
1 1 1
2 2 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
> # missing v2 for 1,2,9,10
> df2
v2 v1
1 3 1
2 4 2
3 5 3
4 6 4
5 7 5
6 8 6
> # some_magic_goes_here ...
> df1
v2 v1
1 1 1
2 2 2
3 3 0 # created
4 4 0 # created
5 5 0 # created
6 6 3
7 7 4
8 8 5
9 9 6
10 10 7
> df2
v2 v1
1 1 0 # created
2 2 0 # created
3 3 1
4 4 2
5 5 3
6 6 4
7 7 5
8 8 6
9 9 0 # created
10 10 0 # created
Thanks
Probably, you can do this by 1) retrieving all possible x-axis values (v2 values) from all data.frames. and 2) using this information to retrieve existing values and/or filling missing ones with zeroes.
set.seed(111)
df1 <- data.frame(v2= sample(1:10, size = 7),
v1 = sample(1:100, size = 1))
df2 <- data.frame(v2= sample(1:10, size = 7),
v1 = sample(1:100, size = 1))
df3 <- data.frame(v2= sample(1:10, size = 7),
v1 = sample(1:100, size = 1))
First, retrieve your categories / x-axis values / v2
Note that if class(df1$v2) == "factor", then you should use levels() instead of unique()
my.x <- unique(c(df1$v2, df2$v2, df3$v2))
Likely, you want it sorted
my.x <- sort(my.x)
Now, use my.x to re-order/fill your data.frames, starting with df1. Specifically, you check each value of my.x: if that value is included in df1$v2, then the corresponding v1 is returned, otherwise 0.
my.df1 <- data.frame(v2 = my.x,
v1 = sapply(my.x, (function(i){
ifelse (i %in% df1$v2, df1$v1[df1$v2 == i], 0)
})))
my.df1
A simple way to apply this operation to all your data.frames is to list them together and then use lapply()
dfs <- list(df1 = df1, df2 = df2, df3 = df3)
dfs <- lapply(dfs, (function(df){
data.frame(v2 = my.x,
v1 = sapply(my.x, (function(i){
ifelse (i %in% df$v2, df$v1[df$v2 == i], 0)
})))
}))
# show all data.frames
dfs
# show df1
dfs$df1

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

How do I delete rows with NAs and those that follow the NAs?

I have some data where I want to remove the NAs and the data that follows the NAs by the level of a factor.
Removing the NAs is easy:
df <- data.frame(a=c("A","A","A","B","B","B","C","C","C","D","D","D"), b=c(0,1,0,0,0,0,0,1,0,0,0,1) ,c=c(4,5,3,2,1,5,NA,5,1,6,NA,2))
df
newdf<-df[complete.cases(df),];newdf
The final result should remove all of the rows for C and the final two rows of D.
Hope you can help.
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'a', get the cumulative sum of logical vector of NA elements in 'c' and check whether it is less than 1 to subset
library(data.table)
setDT(df)[, .SD[cumsum(is.na(c))<1], by= a]
Or a faster option with .I to return the row index of the logical vector and subset the rows.
setDT(df)[df[, .I[cumsum(is.na(c)) < 1], by = a]$V1]
# a b c
#1: A 0 4
#2: A 1 5
#3: A 0 3
#4: B 0 2
#5: B 0 1
#6: B 0 5
#7: D 0 6
A classic split-apply-combine in base R:
do.call(rbind,lapply(split(df, df$a),function(x)x[cumsum(is.na(x$c))<1,]))
Here it is again, but in several lines:
split_df <- split(df, df$a)
apply_df <- lapply(split_df, function(x)x[cumsum(is.na(x$c))<1,])
combine_df <- do.call(rbind, apply_df)
The result:
> do.call(rbind,lapply(split(df, df$a),function(x)x[cumsum(is.na(x$c))<1,]))
# a b c
#A.1 A 0 4
#A.2 A 1 5
#A.3 A 0 3
#B.4 B 0 2
#B.5 B 0 1
#B.6 B 0 5
#D D 0 6
A similar solution in dplyr would be
library(dplyr)
df %>% group_by(a) %>% filter(!is.na(cumsum(c)))
Output:
Source: local data frame [7 x 3]
Groups: a [3]
a b c
<fctr> <dbl> <dbl>
1 A 0 4
2 A 1 5
3 A 0 3
4 B 0 2
5 B 0 1
6 B 0 5
7 D 0 6
If we take the cumulative sum of variable C, any values after the first NA will be converted to NA. Performing this at the group level allows us to remove NA rows and get the desired output.

Replacing the last value within groups with different values

My question is similar to this post, but the difference is instead of replacing the last value within each group/id with all 0's, different values are used to replace the last value within each group/id.
Here is an example (I borrowed it from the above link):
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 9999
6 2 0
7 2 9
8 2 500
9 3 0
10 3 1
In the above link, the last value within each group/id was replaced by a zero, using something like:
df %>%
group_by(id) %>%
mutate(Time = c(Time[-n()], 0))
And the output was
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 0
6 2 0
7 2 9
8 2 0
9 3 0
10 3 0
In my case, I would like the last value within each group/id to be replaced by a different value. Originally, the last value within each group/id was 9999, 500, and 1. Now I would like: 9999 is replaced by 5, 500 is replaced by 12, and 1 is replaced by 92. The desired output is:
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 5
6 2 0
7 2 9
8 2 12
9 3 0
10 3 92
I tried this one:
df %>%
group_by(id) %>%
mutate(Time = replace(Time, n(), c(5,12,92))),
but it did not work.
This could be solved using almost identical solution as I posted in the linked question. e.g., just replace 0L with the desired values
library(data.table)
indx <- setDT(df)[, .I[.N], by = id]$V1
df[indx, Time := c(5L, 12L, 92L)]
df
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 92
So to add some explanations:
.I is identical to row_number() or 1:n() in dplyr for an ungrouped data, e.g. 1:nrow(df) in base R
.N is like n() in dplyr, e.g., the size of a certain group (or the whole data set). So basically when I run .I[.N] by group, I'm retrieving the global index of the last row of each group
The next step is just use this index as a row index within df while assigning the desired values to Time by reference using the := operator.
Edit
Per OPs request, here's a possible dplyr solution. Your original solution doesn't work because you are working per group and thus you were trying to pass all three values to each group.
The only way I can think of is to first calculate group sizes, then ungroup and then mutate on the cumulative sum of these locations, something among these lines
library(dplyr)
df %>%
group_by(id) %>%
mutate(indx = n()) %>%
ungroup() %>%
mutate(Time = replace(Time, cumsum(unique(indx)), c(5, 12, 92))) %>%
select(-indx)
# Source: local data frame [10 x 2]
#
# id Time
# 1 1 3
# 2 1 10
# 3 1 1
# 4 1 0
# 5 1 5
# 6 2 0
# 7 2 9
# 8 2 12
# 9 3 0
# 10 3 92
Another way using data.table would be to create another data.table which contains the values to be replaced with for a given id, and then join and update by reference (simultaneously).
require(data.table) # v1.9.5+ (for 'on = ' feature)
replace = data.table(id = 1:3, val = c(5L, 12L, 9L)) # from #David
setDT(df)[replace, Time := val, on = "id", mult = "last"]
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 9
In data.table, joins are considered as an extension of subsets. It's natural to think of doing whatever operation we do on subsets also on joins. Both operations do something on some rows.
For each replace$id, we find the last matching row (mult = "last") in df$id, and update that row with the corresponding val.
Installation instructions for v1.9.5 here. Hope this helps.

How to use apply function once for each unique factor value

I'm trying on some commands on the R-studio built-in databse, ChickWeight. The data looks as follows.
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 171 18 1 1
11 199 20 1 1
12 205 21 1 1
13 40 0 2 1
14 49 2 2 1
15 58 4 2 1
Now what I would like to do is to simply output the difference between the chicken-weight for the "Chick" column for time 0 and 21 (last time value). I.e the weight the chick has put on.
I've been trying tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1]). But this of course applies the value to all rows.
How do I make it so that it applies only once for each unique Chick-value?
If we need a single value per each 'factor' column (assuming that 'Chick', and 'Diet' are the factor columns)
library(data.table)
setDT(df1)[, list(Diff= abs(weight[Time==21]-weight[Time==0])) ,.(Chick, Diet)]
and If we need to create a column
setDT(df1)[, Diff:= abs(weight[Time==21]-weight[Time==0]) ,.(Chick, Diet)]
I noticed that in the example Time = 21 is not found in the Chick No:2, may be in that case, we need one of the number
setDT(df1)[, {tmp <- Time %in% c(0,21)
list(Diff= if(sum(tmp)>1) abs(diff(weight[tmp])) else weight[tmp]) } ,
by = .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 40
If we are taking the difference of 'weight' based on the max and min 'Time' for each group
setDT(df1)[, list(Diff=weight[which.max(Time)]-
weight[which.min(Time)]), .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 18
Also, if the 'Time' is ordered
setDT(df1)[, list(Diff= abs(diff(weight[c(1L,.N)]))), by =.(Chick, Diet)]
Using by from base R
by(df1[1:2], df1[3:4], FUN= function(x) with(x,
abs(weight[which.max(Time)]-weight[which.min(Time)])))
#Chick: 1
#Diet: 1
#[1] 163
#------------------------------------------------------------
#Chick: 2
#Diet: 1
#[1] 18
Here's a solution using dplyr:
ChickWeight %>%
group_by(Chick = as.numeric(as.character(Chick))) %>%
summarise(weight_gain = last(weight) - first(weight), final_time = last(Time))
(First and last as suggested by #ulfelder.)
Note that ChickWeight$Chick is an ordered factor so without coercing it into numeric the final order looks odd.
Using base R:
ChickWeight$Chick <- as.numeric(as.character(ChickWeight$Chick))
tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1])

Resources