I have some data in multiple large data tables in R. I wish to merge and produce counts of various variables.
I can produce the counts within individual tables easily using the 'table' command, but I have not yet figured out the economical (preferably base R, one liner) command to then produce combined counts.
aaa<-table(MyData1$MyVar)
bbb<-table(MyData2$MyVar)
> aaa
Dogs 3
Cats 4
Horses 1
Sheep 2
Giraffes 3
> bbb
Dogs 27
Cats 1
Sheep 2
Ocelots 1
Desired Output:
Dogs 30
Cats 5
Horses 1
Sheep 4
Giraffes 3
Ocelots 1
I am sure there is a straightforward Base R way to do this I am just not seeing it.
Base package:
aggregate(V2 ~ V1, data = rbind(df1, df2), FUN = sum)
dplyr:
library(dplyr)
rbind(df1, df2) %>% group_by(V1) %>% summarise(V2 = sum(V2))
Output:
V1 V2
1 Cats 5
2 Dogs 30
3 Giraffes 3
4 Horses 1
5 Sheep 4
6 Ocelots 1
Data:
df1 <- structure(list(V1 = structure(c(2L, 1L, 4L, 5L, 3L), .Label = c("Cats",
"Dogs", "Giraffes", "Horses", "Sheep"), class = "factor"), V2 = c(3L,
4L, 1L, 2L, 3L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(V1 = structure(c(2L, 1L, 4L, 3L), .Label = c("Cats",
"Dogs", "Ocelots", "Sheep"), class = "factor"), V2 = c(27L, 1L,
2L, 1L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-4L))
First merge/concatenate your input, then apply table to it.
table(c(MyData1$MyVar, MyData2$MyVar))
You may run into issue if MyVar is a factor and its levels are different in MyData1 and MyData2. In this case, just lookup how to merge factor variables.
EDIT: if that doesn't suit your need, I suggest the following:
Merge the levels of all "MyVar" throughout all your "MyDatai" tables (from your example, I assume that it makes sense to do this).
total_levels <- unique(c(levels(MyData1$MyVar), levels(MyData2$MyVar)))
MyData1$MyVar <- factor(MyData1$MyVar, levels=total_levels)
MyData2$MyVar <- factor(MyData1$MyVar, levels=total_levels)
Obviously you will need wrap this into an apply-like function if you have around 100 data.frames.
Note that this is a one-time preprocessing operation, so I think it's ok if it is a bit costly. Ideally you can integrate it upstream when you generate/load the data.
At this point, all your "MyVar" have the same levels (but are still the same in terms of content of course). Now the good thing is, since table works with the levels, all your tables will have the same entries:
aaa<-table(MyData1$MyVar)
bbb<-table(MyData2$MyVar)
> aaa
Dogs 3
Cats 4
Horses 1
Sheep 2
Giraffes 3
Ocelot 0
> bbb
Dogs 27
Cats 1
Horses 0
Sheep 2
Giraffes 0
Ocelots 1
And you can just sum them with aaa+bbb or sum if you have a lot. Addition of vectors is lightning fast :)
Related
Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())
I have some data that I am trying to group by consecutive values in R. This solution is similar to what I am looking for, however my data is structured like this:
line_num
1
2
3
1
2
1
2
3
4
What I want to do is group each time the number returns to 1 such that I get groups like this:
line_num
group_num)
1
1
2
1
3
1
1
2
2
2
1
3
2
3
3
3
4
3
Any ideas on the best way to accomplish this using dplyr or base R?
Thanks!
We could use cumsum on a logical vector
library(dplyr)
df2 <- df1 %>%
mutate(group_num = cumsum(line_num == 1))
or with base R
df1$group_num <- cumsum(df1$line_num == 1)
data
df1 <- structure(list(line_num = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 4L
)), class = "data.frame", row.names = c(NA, -9L))
So I have a data frame like this one:
First Group Bob
Joe
John
Jesse
Second Group Jane
Mary
Emily
Sarah
Grace
I would like to fill in the empty cells in the first column in the data frame with the last string in that column i.e
First Group Bob
First Group Joe
First Group John
First Group Jesse
Second Group Jane
Second Group Mary
Second Group Emily
Second Group Sarah
Second Group Grace
With tidyr, there is fill() but it obviously doesn't work with strings. Is there an equivalent for strings? If not is there a way to accomplish this?
Seems fill() is designed to be used in isolation. When using fill() inside a mutate() statement this error appears (regardless of the data type), but it works when using it as just a component of the pipe structure. Could that have been the problem?
Just for full clarity, a quick example. Assuming you have a data frame called 'people' with columns 'group' and 'name', the right structure would be:
people %>%
fill(group)
and the following would give the error you described (and a similar error when using numbers):
people %>%
mutate(
group = fill(group)
)
(I made the assumption that this was output from an R console session. If it's a raw text file the data input may need to be done with read.fwf.)
The display suggests those are empty character values in the "spaces">
First set them to NA and then use na.locf from zoo:
dat[dat==""] <- NA
dat[1:2] <- lapply(dat[1:2], zoo::na.locf)
dat
#------------
V1 V2 V3
1 First Group Bob
2 First Group Joe
3 First Group John
4 First Group Jesse
5 Second Group Jane
6 Second Group Mary
7 Second Group Emily
8 Second Group Sara
9 Second Group Grace
To start with what I was using:
dat <-
structure(list(V1 = structure(c(2L, 1L, 1L, 1L, 3L, 1L, 1L, 1L,
1L), .Label = c("", "First", "Second"), class = "factor"), V2 = structure(c(2L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("", "Group"), class = "factor"),
V3 = structure(c(1L, 6L, 7L, 5L, 4L, 8L, 2L, 9L, 3L), .Label = c("Bob",
"Emily", "Grace", "Jane", "Jesse", "Joe", "John", "Mary",
"Sara"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
If I have to take a stab at what your data structure is, I might have something like this:
df <- data.frame(c1=c("First Group", "", "", "", "Second Group", "", "", "", ""),
c2=c("Bob","Joe","Jon","Jesse","Jane","Mary","Emily","Sara","Grace"),
stringsAsFactors = FALSE)
Then, a very basic way to do this would be by simply looping:
for(i in 2:nrow(df)) if(df$c1[i]=="") df$c1[i] <- df$c1[i-1]
df
c1 c2
1 First Group Bob
2 First Group Joe
3 First Group Jon
4 First Group Jesse
5 Second Group Jane
6 Second Group Mary
7 Second Group Emily
8 Second Group Sara
9 Second Group Grace
However, I would suggest you accept #42-'s solution if you have anything other than a small data set as zoo::na.locf is optimized to work with large numbers of records and is a very respected, widely used stable package.
I have input dataframes Berry and Orange
Berry = structure(list(Name = c("ACT", "ACTION", "ACTIVISM", "ACTS",
"ADDICTION", "ADVANCE"), freq = c(2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("Name",
"freq"), row.names = c(NA, 6L), class = "data.frame")
Orange = structure(list(Name = c("ACHIEVE", "ACROSS", "ACT", "ACTION",
"ADVANCE", "ADVANCING"), freq = c(1L, 3L, 1L, 1L, 1L, 1L)), .Names = c("Name",
"freq"), row.names = c(NA, 6L), class = "data.frame")
Running the following operation will give me the desired output
output = t(merge(Berry,Orange, by = "Name", all = TRUE))
rownames(output) = c("","Berry","Orange")
colnames(output) = output[1,]
output = output[2:3,]
output = data.frame(output)
However, now I have to create output from 72 dataframes similar to Berry and Orange. Since merge appears to work with only two data.frame at a time, I'm not sure what would be the best approach for me. I tried rbind.fill which kept the values but lost the Names. I found this and this but couldn't figure out a solution on my own.
Here is one more data.frame in order to provide a reproducible example
Apple = structure(list(Name = c("ABIDING", "ABLE", "ABROAD", "ACROSS",
"ACT", "ADVANTAGE"), freq = c(1L, 1L, 1L, 4L, 2L, 1L)), .Names = c("Name",
"freq"), row.names = c(NA, 6L), class = "data.frame")
I'm trying to figure out how to obtain outputfrom Apple, Berry, and Orange. I am looking for a solution that would work for multiple dataframes preferably without me having to provide the dataframes manually.
You can assume that the data.frame names to be processed for getting the output is available in a list df_names:
df_names = c("Apple","Berry","Orange")
Or, you can also assume that every data.frame in the Global Environment needs to be processed to create output.
If you have all your data frames in an environment, you can get them into a named list then use package reshape2 to reshape the list. If desired, you can then set the first column as the row names.
library(reshape2)
dcast(melt(Filter(is.data.frame, mget(ls()))), L1 ~ Name)
# L1 ABIDING ABLE ABROAD ACHIEVE ACROSS ACT ACTION ACTIVISM ACTS ADDICTION ADVANCE ADVANCING ADVANTAGE
# 1 Apple 1 1 1 NA 4 2 NA NA NA NA NA NA 1
# 2 Berry NA NA NA NA NA 2 2 1 1 1 1 NA NA
# 3 Orange NA NA NA 1 3 1 1 NA NA NA 1 1 NA
Note: This assumes all your data is in the global environment and that no other data frames are present except the ones to be used here.
We can use tidyverse
library(dplyr)
library(tidyr)
list(Apple = Apple, Orange = Orange, Berry = Berry) %>%
bind_rows(.id = "objName") %>%
spread(Name, freq, fill = 0)
# objName ABIDING ABLE ABROAD ACHIEVE ACROSS ACT ACTION ACTIVISM ACTS ADDICTION ADVANCE ADVANCING ADVANTAGE
#1 Apple 1 1 1 0 4 2 0 0 0 0 0 0 1
#2 Berry 0 0 0 0 0 2 2 1 1 1 1 0 0
#3 Orange 0 0 0 1 3 1 1 0 0 0 1 1 0
As you have 72 data.frames, it is better not to create all these objects in the global environment. Instead, read the dataset files in a list and then do the processing. Suppose, if the files are all in the working directory
files <- list.files(pattern = ".csv")
lapply(files, read.csv, stringsAsFactors=FALSE)
and then do the processing with bind_rows as above. As it is not clear about the file names, we cannot comment on how to create the 'objName'
I have a data.frame 'data', where one column contains integer values between 1:100, which are coded values for the Isolate they represent.
Here's my example data, 'data':
Size Isolate spin
1 primary 3 up
2 primary 4 down
3 sec 6 strange
4 ter 1 charm
5 sec 3 bottom
6 quart 2 top
I have another data.frame that contains the key between the integers and the name of the Isolate
1 alpha
2 bravo
3 charlie
4 delta
5 echo
6 foxtrot
7 golf
This list is 100 Isolates in length, too much to type in by hand with if/else.
I'd like to know an easy solution to replacing the integers in my first data.frame, whic aren't in ascending order as you can see, with the corresponding Isolate names in the second data.frame.
I tried, after researching:
data$Isolate <- as.numeric(factor(data$Isolate,
levels =c("alpha","bravo","charlie","delta","echo","foxtrot","golf")
)
)
but this just replaced the Isolate column with N/A.
As Hubert said in the comments, this is a simple use-case for merge.
Let's say the column names of your second "key" data frame are "Isolate" and "Isolate_Name", then it's as easy as
merge(data, key_data, by = "Isolate")
The default is for an "inner join" which will only keep records that have matches. If you're worried about losing records that don't have matches you can add the argument all.x = TRUE.
If you prefer non-base packages, this is easy in data.table or dplyr as well.
Using factor, you could try:
data$Isolate <- factor(data$Isolate,
levels=1:7,
labels =c("alpha","bravo","charlie","delta","echo","foxtrot","golf"))
If you have many levels that are already in their own data.frame, you could automate this.
data$Isolate <- factor(data$Isolate,levels=code$No,labels=code$Value)
With your second data.frame, code:
code <- read.table(text="1 alpha
2 bravo
3 charlie
4 delta
5 echo
6 foxtrot
7 golf",stringsAsFactor=FALSE)
names(code) <- c("No","Value")
df$Isolate <- df2[,1][df$Isolate]
# Size Isolate spin
# 1 primary charlie up
# 2 primary delta down
# 3 sec foxtrot strange
# 4 ter alpha charm
# 5 sec charlie bottom
# 6 quart bravo top
You can subset the lookup data frame by the target data frame.
Data
df <- structure(list(Size = structure(c(1L, 1L, 3L, 4L, 3L, 2L), .Label = c("primary",
"quart", "sec", "ter"), class = "factor"), Isolate = c(3L, 4L,
6L, 1L, 3L, 2L), spin = structure(c(6L, 3L, 4L, 2L, 1L, 5L), .Label = c("bottom",
"charm", "down", "strange", "top", "up"), class = "factor")), .Names = c("Size",
"Isolate", "spin"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
df2 <- structure(list(V2 = structure(1:7, .Label = c("alpha", "bravo",
"charlie", "delta", "echo", "foxtrot", "golf"), class = "factor")), .Names = "V2", class = "data.frame", row.names = c(NA,
-7L))