Sample
A=data.frame("id"=c(1:10))
B=data.frame("id"=c(7:16))
C=data.frame("id"=c(-10:-1))
mylist=c(A,B,C)
What I want is a list which combindes these three data.frames into a single one:
WANT = data.frame("id"=c(1:10,7:16,-10:-1),
dataID=c(rep("A",10),rep("B",10),rep("C",10)))
If suppose I have list which contains a bunch of data frames (this is how I am given the data). I want to put them into one really big data frame/set like "WANT" that uses the names of the data sets in the list for dataID. I am able to do this with just a few for example A,B,C but I have like a hundred and am wondering how do i pull out the data frames in list and make a tall file like the "WANT" example.
you can add the dataID into the single dataframes and then bind them together:
EDIT: after some clarification, here is a new approach
listNAMES = letters[1:3]
library(tidyverse)
tibble(mydata = list(A, B, C),
dataID = listNAMES) %>%
unnest()
# A tibble: 30 x 2
names id
<chr> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
# ... with 20 more rows
Related
I am doing social network analysis and working with two data frames. Dataframe A (or "nodes") has the information related to each node of the network (i.e. id and name). Dataframe B (or "links") has two columns: "from" and "to" which basically shows how the nodes are connected between them. Each row represents a link "from" one node "to" the other.
I want to use the package networkD3 to visualize the network but it has some requirements: id's should start from zero and they have to be consecutive (0,1,2, etc). Because my nodes and links are a random subset from a larger database, they are not consecutive.
I sorted the "nodes" data frame based on the id and created a new column (new_id) starting from zero and with consecutive numbers. But now, I don't know how to update the "links" data frame based on the new_id's.
Currently, I am converting the values in the "links" data frame to characters and then revaluing them using the plyr package. But I need to do this for a larger dataset.
I am copying a sample of the two data frame that I have now:
set.seed(10)
nodes_df <- data.frame(id = c(1,3,5,6,8,10),
name = c("Agriculture", "Agriculture_in_Mesoamerica", "Agriculture_in_ancient_Greece",
"Agriculture_in_ancient_Rome", "Agriculture_in_India", "Agriculture_in_China"),
new_id = seq(0,5))
links_df <- data.frame(from = c(3,3,5,6,8,10),
to = c(1,5,6,8,10,3))
In summary, I need to update the values in the links_df to correspond to the new_id values from the nodes_df.
Thank you so much in advance. I hope I was clear enough.
Best regards,
In base you just need to use merge and extract your required column
links_df$new_to <- merge(links_df, nodes_df,
by.x = "to", by.y = "id",
all.x = TRUE)$new_id
links_df$new_from <- merge(links_df, nodes_df,
by.x = "from", by.y = "id",
all.x = TRUE)$new_id
links_df <- links_df[,c(1,2,4,3)] # Reordering columns
links_df
from to new_from new_to
1 3 1 1 0
2 3 5 1 1
3 5 6 2 2
4 6 8 3 3
5 8 10 4 4
6 10 3 5 5
An alternative to merging or joining could be to use recode. A solution (based in the tidyverse) could look as follows.
library(dplyr)
library(tibble)
swap <- deframe(tibble(id = nodes_df$id, new_id = nodes_df$new_id))
links_df %>%
mutate(new_from = recode(from, !!!swap),
new_to = recode(to, !!!swap))
# from to new_from new_to
# 1 3 1 1 0
# 2 3 5 1 2
# 3 5 6 2 3
# 4 6 8 3 4
# 5 8 10 4 5
# 6 10 3 5 1
Technically speaking, networkD3 expects the values in the links data frame to be the (zero-based) index of the nodes they refer to in the nodes data frame. So the first row/node in the nodes data frame is 0, and so forth.
You can use match() to determine the 1-based index of each element in a vector in a target vector, and subtract 1 to get a 0-based index.
links_df$from
#> [1] 3 3 5 6 8 10
nodes_df$id
#> [1] 1 3 5 6 8 10
match(links_df$from, nodes_df$id) - 1
#> [1] 1 1 2 3 4 5
links_df$to
#> [1] 1 5 6 8 10 3
nodes_df$id
#> [1] 1 3 5 6 8 10
match(links_df$to, nodes_df$id) - 1
#> [1] 0 2 3 4 5 1
Created on 2021-03-28 by the reprex package (v1.0.0)
I have two separate dataframes each for one speaker of an interacting dyad. They have different amounts of talk-turns (rows) which is why I keep them in separate files for now.
In order to run my final analyses I need identical number of rows for each speaker.
So what I want to do is compare dyad_id 1 in both data frames and then shorten the longer list for one by deleting the last row for all columns.
I prepared a data frame to illustrate what I already have.
So far, I tried to split the data frame by the dyad_id in both data sets to now compare the splits one after another and delete the unnecessary rows. As I have various conversations, I need to automate this to go through all dyad_ids one after another.
I hope someone can help me, I am completely lost.
dyad_id_A <- c(1,1,1,2,2,2,2,3,3,3,3,3)
fw_quantiles_a <- c(4,3,1,2,3,2,4,1,4,5,6,7)
df_A<- data.frame(dyad_id_A,fw_quantiles_a)
dyad_id_B <- c(1,1,1,1,2,2,2,3,3,3,3)
fw_quantiles_b <- c(3,1,2,1,2,4,1,3,3,4,5)
df_B <- data.frame(dyad_id_B,fw_quantiles_b)
example for final dataset
dyad_id_AB <- c(1,1,1,2,2,2,3,3,3,3)
What I tried so far:
split_conv_A = split(df_A, list(df_A$dyad_id_A))
split_conv_B = split(df_B, list(df_B$dyad_id_B))
Add a time counter within each dyad_id_x group and then merge together:
df_A$time <- ave(df_A$dyad_id_A, df_A$dyad_id_A, FUN=seq_along)
df_B$time <- ave(df_B$dyad_id_B, df_B$dyad_id_B, FUN=seq_along)
merge(
df_A, df_B,
by.x=c("dyad_id_A","time"), by.y=c("dyad_id_B","time")
)
# dyad_id_A time fw_quantiles_a fw_quantiles_b
#1 1 1 4 3
#2 1 2 3 1
#3 1 3 1 2
#4 2 1 2 2
#5 2 2 3 4
#6 2 3 2 1
#7 3 1 1 3
#8 3 2 4 3
#9 3 3 5 4
#10 3 4 6 5
Maybe we can try using table to calculate frequncies of id's in both the dataframe assuming you have the same id's in both the dataframe. Calculate the minimum between them using pmin and repeat the names based on the frequency.
tab <- pmin(table(df_A$dyad_id_A), table(df_B$dyad_id_B))
as.integer(rep(names(tab), tab))
# [1] 1 1 1 2 2 2 3 3 3 3
I did find a thread on this (R equivalent of .first or .last sas operator) but it did not fully answer my question.
I come from a SAS background and a common operation is, for example, when you have your patient ID with several different values, and you want to keep only the row with the minimum/maximum value for another variable for each ID. For example, I might have data with dates of a certain medical problem for each ID, and I want a dataset with just the first/last problem date for each patient.
Here's a simple example that gets me what I'm want, but I want to know if there's a better way to do it. I sort by ID, and then count, and I want to just keep the row with the largest count for each ID.
testdata<-data.frame(id=c(1,1,1,2,3,3,4,3,4,4,4),
count=c(5,9,2,6,16,12,0,11,8,8,7))
library(dplyr)
testdata2<-arrange(testdata,id,count)
testdata3<-cbind(testdata2,!duplicated(testdata2$id,fromLast=TRUE))
testdata4<-subset(testdata3,testdata3[,3]=='TRUE')[,-3]
> testdata4
id count
3 1 9
4 2 6
7 3 16
11 4 8
Is there a more compact way to do this?
Thank you.
do.call(rbind.data.frame,
c(by(testdata, testdata$id, function(d) d[c(1L,nrow(d)),]), stringsAsFactors=FALSE))
# id count
# 1.1 1 5
# 1.3 1 2
# 2.4 2 6
# 2.4.1 2 6
# 3.5 3 16
# 3.8 3 11
# 4.7 4 0
# 4.11 4 7
Breaking it down:
d[c(1L,nrow(d)),] returns the first and last row from the dataframe. (I'm assuming the frame has already been ordered appropriately.)
by(testdata, testdata$id, function breaks the larger frame into smaller frames by $id, and passes each smaller frame to the anonymous function. This returns a by-list of each return value.
do.call(rbind.data.frame, grabs the list and row-binds them back together into a single frame. Since the default is to use factors, I added stringsAsFactors=FALSE.
If you want to use dplyr, you can do:
library(dplyr)
group_by(testdata, id) %>%
slice(c(1,n())) %>%
ungroup()
# # A tibble: 8 × 2
# id count
# <dbl> <dbl>
# 1 1 5
# 2 1 2
# 3 2 6
# 4 2 6
# 5 3 16
# 6 3 11
# 7 4 0
# 8 4 7
where n() is a special function within dplyr pipes that returns the number of rows in that (optionally-grouped) frame.
I just have a data frame and want to split the data frame by rows, assign the several new data frames to new variables and save them as csv files.
a <- rep(1:5,each=3)
b <-rep(1:3,each=5)
c <- data.frame(a,b)
# a b
1 1 1
2 1 1
3 1 1
4 2 1
5 2 1
6 2 2
7 3 2
8 3 2
9 3 2
10 4 2
11 4 3
12 4 3
13 5 3
14 5 3
15 5 3
I want to split c by column a. i.e all rows are 1 in column a are split from c and assign it to A and save A as A.csv. The same to B.csv with all 2 in column a.
What I can do is
A<-c[c$a%in%1,]
write.csv (A, "A.csv")
B<-c[c$a%in%2,]
write.csv (B, "B.csv")
...
If I have 1000 rows and there will be lots of subsets, I just wonder if there is a simple way to do this by using for loop?
The split() function is very useful to split data frame. Also, you can use lapply() here - it should be more efficient than a loop.
dfs <- split(c, c$a) # list of dfs
# use numbers as file names
lapply(names(dfs),
function(x){write.csv(dfs[[x]], paste0(x,".csv"),
row.names = FALSE)})
# or use letters (max 26!) as file names
names(dfs) <- LETTERS[1:length(dfs)]
lapply(names(dfs),
function(x){write.csv(dfs[[x]],
file = paste0(x,".csv"),
row.names = FALSE)})
for(i in seq_along(unique(c$a))){
write.csv(c[c$a == i,], paste0(LETTERS[i], ".csv"))}
You should consider, however, what happens if you have more than 26 subsets. What will those files be named?
Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]