Combine several data.frames into one (keep every rownames) - r

I have input dataframes Berry and Orange
Berry = structure(list(Name = c("ACT", "ACTION", "ACTIVISM", "ACTS",
"ADDICTION", "ADVANCE"), freq = c(2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("Name",
"freq"), row.names = c(NA, 6L), class = "data.frame")
Orange = structure(list(Name = c("ACHIEVE", "ACROSS", "ACT", "ACTION",
"ADVANCE", "ADVANCING"), freq = c(1L, 3L, 1L, 1L, 1L, 1L)), .Names = c("Name",
"freq"), row.names = c(NA, 6L), class = "data.frame")
Running the following operation will give me the desired output
output = t(merge(Berry,Orange, by = "Name", all = TRUE))
rownames(output) = c("","Berry","Orange")
colnames(output) = output[1,]
output = output[2:3,]
output = data.frame(output)
However, now I have to create output from 72 dataframes similar to Berry and Orange. Since merge appears to work with only two data.frame at a time, I'm not sure what would be the best approach for me. I tried rbind.fill which kept the values but lost the Names. I found this and this but couldn't figure out a solution on my own.
Here is one more data.frame in order to provide a reproducible example
Apple = structure(list(Name = c("ABIDING", "ABLE", "ABROAD", "ACROSS",
"ACT", "ADVANTAGE"), freq = c(1L, 1L, 1L, 4L, 2L, 1L)), .Names = c("Name",
"freq"), row.names = c(NA, 6L), class = "data.frame")
I'm trying to figure out how to obtain outputfrom Apple, Berry, and Orange. I am looking for a solution that would work for multiple dataframes preferably without me having to provide the dataframes manually.
You can assume that the data.frame names to be processed for getting the output is available in a list df_names:
df_names = c("Apple","Berry","Orange")
Or, you can also assume that every data.frame in the Global Environment needs to be processed to create output.

If you have all your data frames in an environment, you can get them into a named list then use package reshape2 to reshape the list. If desired, you can then set the first column as the row names.
library(reshape2)
dcast(melt(Filter(is.data.frame, mget(ls()))), L1 ~ Name)
# L1 ABIDING ABLE ABROAD ACHIEVE ACROSS ACT ACTION ACTIVISM ACTS ADDICTION ADVANCE ADVANCING ADVANTAGE
# 1 Apple 1 1 1 NA 4 2 NA NA NA NA NA NA 1
# 2 Berry NA NA NA NA NA 2 2 1 1 1 1 NA NA
# 3 Orange NA NA NA 1 3 1 1 NA NA NA 1 1 NA
Note: This assumes all your data is in the global environment and that no other data frames are present except the ones to be used here.

We can use tidyverse
library(dplyr)
library(tidyr)
list(Apple = Apple, Orange = Orange, Berry = Berry) %>%
bind_rows(.id = "objName") %>%
spread(Name, freq, fill = 0)
# objName ABIDING ABLE ABROAD ACHIEVE ACROSS ACT ACTION ACTIVISM ACTS ADDICTION ADVANCE ADVANCING ADVANTAGE
#1 Apple 1 1 1 0 4 2 0 0 0 0 0 0 1
#2 Berry 0 0 0 0 0 2 2 1 1 1 1 0 0
#3 Orange 0 0 0 1 3 1 1 0 0 0 1 1 0
As you have 72 data.frames, it is better not to create all these objects in the global environment. Instead, read the dataset files in a list and then do the processing. Suppose, if the files are all in the working directory
files <- list.files(pattern = ".csv")
lapply(files, read.csv, stringsAsFactors=FALSE)
and then do the processing with bind_rows as above. As it is not clear about the file names, we cannot comment on how to create the 'objName'

Related

group ordered line numbers

I have some data that I am trying to group by consecutive values in R. This solution is similar to what I am looking for, however my data is structured like this:
line_num
1
2
3
1
2
1
2
3
4
What I want to do is group each time the number returns to 1 such that I get groups like this:
line_num
group_num)
1
1
2
1
3
1
1
2
2
2
1
3
2
3
3
3
4
3
Any ideas on the best way to accomplish this using dplyr or base R?
Thanks!
We could use cumsum on a logical vector
library(dplyr)
df2 <- df1 %>%
mutate(group_num = cumsum(line_num == 1))
or with base R
df1$group_num <- cumsum(df1$line_num == 1)
data
df1 <- structure(list(line_num = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 4L
)), class = "data.frame", row.names = c(NA, -9L))

How do I apply multiple conditions to create new variable from current variables in R?

I am trying to apply multiple conditions to current variables to create a new variable. I am looking at 5 variables/columns and want to apply the following conditions. If 3 out of the 5 variables are >0, take the mean of only of variables that are >0. If <3 variables are >0 then NA
For example,
ID
Work
School
School 2
Work 1
Work 2
New Variable
1
5
-2
-1
-7
2
NA
2
3
5
-1
2
3
3.25
3
4
5
1
3
3
3.2
4
-1
2
-7
2
4
2.67
df[["New Variable"]] <- apply(df[, -1], 1, function(row) {
if (sum(row > 0) >= 3) mean(row[row > 0]) else NA
})
df[, -1] is to ignore first column (ID)
Where df is:
df <- read.table(text =
'ID Work School "School 2" "Work 1" "Work 2"
1 5 -2 -1 -7 2
2 3 5 -1 2 3
3 4 5 1 3 3
4 -1 2 -7 2 4
', header = TRUE, check.names = FALSE)
Using a vectorized solution where we create a logical matrix (i1), then create a logical vector with rowSums and use ifelse to return the mean of only those have greater than or equal to 3 positive values by changing the other elements to NA and making use of na.rm argument
i1 <- df[-1] > 0
df$New <- ifelse(rowSums(i1) >= 3, rowMeans(replace(df[-1],
!i1, NA), na.rm = TRUE), NA)
data
df <- structure(list(ID = 1:4, Work = c(5L, 3L, 4L, -1L), School = c(-2L,
5L, 5L, 2L), `School 2` = c(-1L, -1L, 1L, -7L), `Work 1` = c(-7L,
2L, 3L, 2L), `Work 2` = c(2L, 3L, 3L, 4L)), class = "data.frame",
row.names = c(NA,
-4L))

Add unique rows to dataframe (opposite of intersect)

I am relatively new to R but slowly finding my way. I encountered a problem, however, and hope someone can help me.
Let's say I two dataframes (lets call them A and B), both containing survey responses. A contains all responses from the first set of people. B contains the responses of the second set of people, plus the people of the first set but with their responses set to NA. An example:
Dataframe A:
Household Individual Answer_A Answer_b
1 2 5 6
1 3 6 6
2 1 2 3
Dataframe B:
Household Individual Answer_A Answer_b
1 1 3 6
1 2 NA NA
1 3 NA NA
2 1 NA NA
2 2 4 7
I want to get one dataframe with all individuals and their responses:
Dataframe C:
Household Individual Answer_A Answer_b
1 1 3 6
1 2 5 6
1 3 6 6
2 1 2 3
2 2 4 7
If I only have two datasets I can use rbind.fill, with rbind.fill(B, A) to get dataframe C, as then the NAs in B are overwritten with answers in A.
But... if I would have to add a third dataset, D, that would consist of NAs for people in A and B, I would not be able to use this solution. What would I be able to do then? I've looked at intersect, outersect, different forms of join, but can't seem to think of a good solution.
Any thoughts?
Maybe you can left_join and then use coalesce
library(dplyr)
left_join(B, A, by = c("Household", "Individual")) %>%
mutate(Answer_A = coalesce(Answer_A.x, Answer_A.y),
Answer_B = coalesce(Answer_b.x, Answer_b.y)) %>%
select(-matches("\\.x|\\.y"))
# Household Individual Answer_A Answer_B
#1 1 1 3 6
#2 1 2 5 6
#3 1 3 6 6
#4 2 1 2 3
#5 2 2 4 7
data
A <- structure(list(Household = c(1L, 1L, 2L), Individual = c(2L,
3L, 1L), Answer_A = c(5L, 6L, 2L), Answer_b = c(6L, 6L, 3L)), class = "data.frame",
row.names = c(NA, -3L))
B <- structure(list(Household = c(1L, 1L, 1L, 2L, 2L), Individual = c(1L,
2L, 3L, 1L, 2L), Answer_A = c(3L, NA, NA, NA, 4L), Answer_b = c(6L,
NA, NA, NA, 7L)), class = "data.frame", row.names = c(NA, -5L))

Replacing values with 'NA' by ID in R

I have data that looks like this
ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1
I want to replace all values with 'NA' if the ID occurs more than once in the dataframe. The final product should look like this
ID v1 v2
1 1 0
2 0 1
3 NA NA
3 NA NA
4 0 1
I could do this by hand, but I want R to detect all the duplicate cases (in this case two times ID '3') and replace the values with 'NA'.
Thanks for your help!
You could use duplicated() from either end, and then replace.
idx <- duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE)
df[idx, -1] <- NA
which gives
ID v1 v2
1 1 1 0
2 2 0 1
3 3 NA NA
4 3 NA NA
5 4 0 1
This will also work if the duplicated IDs are not next to each other.
Data:
df <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
One more option:
df1[df1$ID %in% df1$ID[duplicated(df1$ID)], -1] <- NA
#> df1
# ID v1 v2
#1 1 1 0
#2 2 0 1
#3 3 NA NA
#4 3 NA NA
#5 4 0 1
data
df1 <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L,
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1",
"v2"), class = "data.frame", row.names = c(NA, -5L))
Here is a base R method
# get list of repeated IDs
repeats <- rle(df$ID)$values[rle(df$ID)$lengths > 1]
# set the corresponding variables to NA
df[, -1] <- sapply(df[, -1], function(i) {i[df$ID %in% repeats] <- NA; i})
In the first line, we use rle to extract repeated IDs. In the second, we use sapply to loop through non-ID variables and replace IDs that repeat with NA for each variable.
Note that this assumes that the data set is sorted by ID. This may be accomplished with the order function. (df <- df[order(df$ID),]).
If the dataset is very large, you might break up the first function into two steps to avoid computing the rle twice:
dfRle <- rle(df$ID)
repeats <- dfRle$values[dfRle$lengths > 1]
data
df <- read.table(header=T, text="ID v1 v2
1 1 0
2 0 1
3 1 0
3 0 1
4 0 1")

Create merged tables in R

I have some data in multiple large data tables in R. I wish to merge and produce counts of various variables.
I can produce the counts within individual tables easily using the 'table' command, but I have not yet figured out the economical (preferably base R, one liner) command to then produce combined counts.
aaa<-table(MyData1$MyVar)
bbb<-table(MyData2$MyVar)
> aaa
Dogs 3
Cats 4
Horses 1
Sheep 2
Giraffes 3
> bbb
Dogs 27
Cats 1
Sheep 2
Ocelots 1
Desired Output:
Dogs 30
Cats 5
Horses 1
Sheep 4
Giraffes 3
Ocelots 1
I am sure there is a straightforward Base R way to do this I am just not seeing it.
Base package:
aggregate(V2 ~ V1, data = rbind(df1, df2), FUN = sum)
dplyr:
library(dplyr)
rbind(df1, df2) %>% group_by(V1) %>% summarise(V2 = sum(V2))
Output:
V1 V2
1 Cats 5
2 Dogs 30
3 Giraffes 3
4 Horses 1
5 Sheep 4
6 Ocelots 1
Data:
df1 <- structure(list(V1 = structure(c(2L, 1L, 4L, 5L, 3L), .Label = c("Cats",
"Dogs", "Giraffes", "Horses", "Sheep"), class = "factor"), V2 = c(3L,
4L, 1L, 2L, 3L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(V1 = structure(c(2L, 1L, 4L, 3L), .Label = c("Cats",
"Dogs", "Ocelots", "Sheep"), class = "factor"), V2 = c(27L, 1L,
2L, 1L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-4L))
First merge/concatenate your input, then apply table to it.
table(c(MyData1$MyVar, MyData2$MyVar))
You may run into issue if MyVar is a factor and its levels are different in MyData1 and MyData2. In this case, just lookup how to merge factor variables.
EDIT: if that doesn't suit your need, I suggest the following:
Merge the levels of all "MyVar" throughout all your "MyDatai" tables (from your example, I assume that it makes sense to do this).
total_levels <- unique(c(levels(MyData1$MyVar), levels(MyData2$MyVar)))
MyData1$MyVar <- factor(MyData1$MyVar, levels=total_levels)
MyData2$MyVar <- factor(MyData1$MyVar, levels=total_levels)
Obviously you will need wrap this into an apply-like function if you have around 100 data.frames.
Note that this is a one-time preprocessing operation, so I think it's ok if it is a bit costly. Ideally you can integrate it upstream when you generate/load the data.
At this point, all your "MyVar" have the same levels (but are still the same in terms of content of course). Now the good thing is, since table works with the levels, all your tables will have the same entries:
aaa<-table(MyData1$MyVar)
bbb<-table(MyData2$MyVar)
> aaa
Dogs 3
Cats 4
Horses 1
Sheep 2
Giraffes 3
Ocelot 0
> bbb
Dogs 27
Cats 1
Horses 0
Sheep 2
Giraffes 0
Ocelots 1
And you can just sum them with aaa+bbb or sum if you have a lot. Addition of vectors is lightning fast :)

Resources