I have unequal sized list and I want to append specific items from one list to specific positions in another list
First list
dat <- structure(list(supergrp = c("D", "A", "P", "B"),
clusters = c("1", "2", "3", "1"),
items = structure(list(`1.2` = c("a", "c", "d"),
`2.1` = "b", `3` = "e", `4` = c("e", "b")),
.Names = c("1.2", "2.1", "3", "4"))),
.Names = c("supergrp", "clusters", "items"),
row.names = c(NA, 4L), class = "data.frame")
second list
val_to_append <- structure(list(supergrp = c("D", "A"),
clusters = c(1, 2),
items = structure(list(`1.2` = c("c", "f"),
`2.1` = c("c", "d", "e")),
.Names = c("1.2", "2.1"))),
.Names = c("supergrp", "clusters", "items"),
row.names = c(NA, -2L), class = "data.frame")
I want to append val_to_append$item[[1]] to dat$item[[3]]
Similarly, I want to append item val_to_append$item[[2]] to dat$item[[1]]
The required output is
supergrp clusters items
1 D 1 a, c, d, e
2 A 2 b
3 P 3 e, c, f
4 B 1 e, b
I can do this in loop
dat_indx <- c(3,1)
val_indx <- c(1,2)
fin_result <- dat
for(i in seq_along(dat_indx)) {
out_put_indx <- dat_indx[[i]]
fin_result$items[[dat_indx[[i]]]] <- unique(c(fin_result$items[[dat_indx[[i]]]],
val_to_append$items[[val_indx[[i]]]]))
}
I tried normal vector indexing such as
append(fin_result$items[[dat_indx]], val_to_append$items[[val_indx]])
without success. Is there an efficient way to do this because my list, aka, dataframe is very large hundreds of thousands of samples.
I am thinking of sapply but don't have concrete idea
We can use mapply to achieve this. We append the values from val_to_append$items to dat$items using the index value which is known before hand.
dat_indx <- c(3,1)
val_indx <- c(1,2)
dat$items[dat_indx] <- mapply(function(x, y)
unique(c(dat$items[[x]], val_to_append$items[[y]])), dat_indx, val_indx)
dat
# supergrp clusters items
#1 D 1 a, c, d, e
#2 A 2 b
#3 P 3 e, c, f
#4 B 1 e, b
Although, this is another way of solving the problem I doubt how efficient it is going to be.
Related
Update 1
Linking the actual dataset since the solutions given for the example data are not working out for me.
Link: https://app.box.com/s/65j1enr13pi51i44mfrymccklw1artot
Please note that LOT is the end of the row marker.
--
I've data frame like the following (single column):
D
2
f
h
k
END_ROW_WORD
k
1
2
END_ROW_WORD
e
g
j
2
k
END_ROW_WORD
I'd like to convert it into following format:
As you can see there is a specific word (END_ROW_WORD) that marks the end of the row.
Here is a similar approach to Alejandro's, but using split instead of a for loop:
colstarts <- diff(c(0, which(df == "END_ROW_WORD")))
rows <- split(df[[1]], rep(1:length(colstarts), colstarts))
rows <- lapply(rows, `length<-`, max(lengths(rows)))
as.data.frame(do.call(rbind, rows))
A solution without for-loops, but with stringr
library(stringr)
new_text <- str_c(df$V1, collapse = " ")
new_text <- str_replace_all(new_text, "END_ROW_WORD", "END_ROW_WORD\n")
read.table(text = new_text, fill = T)
# V1 V2 V3 V4 V5 V6
# 1 D 2 f h k END_ROW_WORD
# 2 k 1 2 END_ROW_WORD
# 3 e g j 2 k END_ROW_WORD
Data
df <-
structure(list(V1 = structure(c(3L, 2L, 6L, 8L, 10L, 5L, 10L, 1L, 2L, 5L, 4L, 7L, 9L, 2L, 10L, 5L),
.Label = c("1", "2", "D", "e", "END_ROW_WORD", "f", "g", "h", "j", "k"),
class = "factor")),
.Names = "V1", class = "data.frame", row.names = c(NA, -16L))
This first puts a newline character, "\n", after every "END_ROW_WORD" marker, then pastes the result into a long character string.
Then, it uses read.table to read the data in from a text connection.
end <- "END_ROW_WORD"
inx <- c(0, grep(end, dat[[1]]))
s <- NULL
for(i in seq_along(inx)[-1]){
s <- c(s, dat[[1]][(inx[(i - 1)] + 1):inx[i]], "\n")
}
con <- textConnection(paste(s, collapse = " "))
result <- read.table(con, fill = TRUE)
close(con)
result
# V1 V2 V3 V4 V5 V6
#1 D 2 f h k END_ROW_WORD
#2 k 1 2 END_ROW_WORD
#3 e g j 2 k END_ROW_WORD
DATA.
dat <-
structure(list(V1 = c("D", "2", "f", "h", "k", "END_ROW_WORD",
"k", "1", "2", "END_ROW_WORD", "e", "g", "j", "2", "k", "END_ROW_WORD"
)), .Names = "V1", class = "data.frame", row.names = c(NA, -16L
))
EDIT.
After the question's edit by the OP, I revised the code to see if that file can be properly read into a data.frame. The main difficulty is that the file has many non printable characters, and read.table was having trouble getting to the end of the file.
Credits to the solution of this problem go to the accepted answer in read.csv warning 'EOF within quoted string' prevents complete reading of file. I upvoted both the question and that answer.
Credits must also be given to #kath, in the answer the idea of using a string replace to put newline characters as EOL markers is much better than my ugly for loop above. Unlike kath, I use base R only, I don't find it necessary to load an external package.
Now the revised code.
# Use this first pattern if AUCTION also marks the end of a row
#pattern <- "(^LOT|^AUCTION)"
pattern <- "(^LOT)"
dat <- readLines("data_.csv")
s <- gsub("[[:cntrl:]]", "", dat)
s <- sub(pattern, "\\1\n", s)
con <- textConnection(paste(s, collapse = "\t"))
result <- read.table(con, sep = "\t", fill = TRUE, quote = "", row.names = NULL)
close(con)
head(result)
tail(result)
str(result)
I thought that there would be some empty rows, so I checked it with the following code.
#
# See if there are any empty rows
#
empty <- apply(result, 1, function(x) nchar(trimws(paste0(x, collapse = ""))) == 0)
sum(empty)
#[1] 0
This might not be the best way to do it but it works
pos_help = which(grepl("END_ROW_WORD",data))
d = list()
for(i in 1:length(pos_help)){
if(i == 1){
d[[i]] = data[1:pos_help[1]]
} else {
d[[i]] = data[(pos_help[i-1]+1):pos_help[i]]
}
}
dataFrame = do.call(rbind,lapply(d, "length<-", max(lengths(d))))
without loop, but using map and split.... (because why not :p )
library(tidyverse)
df <- tibble(x=c(
"D",
"2",
"f",
"h",
"k",
"END_ROW_WORD",
"k",
"1",
"2",
"END_ROW_WORD",
"e",
"g",
"j",
"2",
"k",
"END_ROW_WORD"
)
)
split(df,cut(1:16,breaks=c(0,which(df == "END_ROW_WORD")))) %>%
map_dfc(~rbind(.x,tibble(x=rep(NA,(6-nrow(.x)))))) %>%
t() %>% as.data.frame()
I have a Qualtrics multiple choice question that I want to use to create graphs in R. My data is organized so that you can answer multiple answers for each question. For example, participant 1 selected multiple choice answers 1 (Q1_1) & 3 (Q1_3). I want to collapse all answer choices in one bar graph, one bar for each multiple response option (Q1_1:Q1_3) divided by the number of respondents who answered this question (in this case, 3).
df <- structure(list(Participant = 1:3, A = c("a", "a", ""), B = c("", "b", "b"), C = c("c", "c", "c")), .Names = c("Participant", "Q1_1", "Q1_2", "Q1_3"), row.names = c(NA, -3L), class = "data.frame")
I want to use ggplot2 and maybe some sort of loop through Q1_1: Q1_3?
Perhaps this is what you want
f <-
structure(
list(
Participant = 1:3,
A = c("a", "a", ""),
B = c("", "b", "b"),
C = c("c", "c", "c")),
.Names = c("Participant", "Q1_1", "Q1_2", "Q1_3"),
row.names = c(NA, -3L),
class = "data.frame"
)
library(tidyr)
library(dplyr)
library(ggplot2)
nparticipant <- nrow(f)
f %>%
## Reformat the data
gather(question, response, starts_with("Q")) %>%
filter(response != "") %>%
## calculate the height of the bars
group_by(question) %>%
summarise(score = length(response)/nparticipant) %>%
## Plot
ggplot(aes(x=question, y=score)) +
geom_bar(stat = "identity")
Here is a solution using ddply from dplyr package.
# I needed to increase number of participants to ensure it works in every case
df = data.frame(Participant = seq(1:100),
Q1_1 = sample(c("a", ""), 100, replace = T, prob = c(1/2, 1/2)),
Q1_2 = sample(c("b", ""), 100, replace = T, prob = c(2/3, 1/3)),
Q1_3 = sample(c("c", ""), 100, replace = T, prob = c(1/3, 2/3)))
df$answer = paste0(df$Q1_1, df$Q1_2, df$Q1_3)
summ = ddply(df, c("answer"), summarize, freq = length(answer)/nrow(df))
## Re-ordeing of factor levels summ$answer
summ$answer <- factor(summ$answer, levels=c("", "a", "b", "c", "ab", "ac", "bc", "abc"))
# Plot
ggplot(summ, aes(answer, freq, fill = answer)) + geom_bar(stat = "identity") + theme_bw()
Note : it might be more complicated if you have more columns relating to other questions ("Q2_1", "Q2_2"...). In this case, melting data for each question could be a solution.
I think you want something like this (proportion with a stacked bar chart):
Participant Q1_1 Q1_2 Q1_3
1 1 a c
2 2 a a c
3 3 c b c
4 4 b d
# ensure that all question columns have the same factor levels, ignore blanks
for (i in 2:4) {
df[,i] <- factor(df[,i], levels = c(letters[1:4]))
}
tdf <- as.data.frame(sapply(df[2:4], function(x)table(x)/sum(table(x))))
tdf$choice <- rownames(tdf)
tdf <- melt(tdf, id='choice')
ggplot(tdf, aes(variable, value, fill=choice)) +
geom_bar(stat='identity') +
xlab('Questions') +
ylab('Proportion of Choice')
I have the following data frame in r
ID COL.1 COL.2 COL.3 COL.4
1 a b
2 v b b
3 x a n h
4 t
I am new to R and I don't understand how to call the data fram in order to have this at the end, another problem is that i have more than 100 columns
stream <- c("1,a,b","2,v,b,b","3,x,a,n,h","4,t")
another problem is that I have more than 100 columns .
Try this
Reduce(function(...)paste(...,sep=","), df)
Where df is your data.frame
This might be what you're looking for, even though it's not elegant.
my_df <- data.frame(ID = seq(1, 4, by = 1),
COL.1 = c("a", "v", "x", "t"),
COL.2 = c("b", "b", "a", NULL),
COL.3 = c(NULL, "b", "n", NULL),
COL.4 = c(NULL, NULL, "h", NULL))
stream <- substring(paste(my_df$ID,
my_df$COL.1,
my_df$COL.2,
my_df$COL.3,
my_df$COL.4,
sep =","), 3)
stream <- gsub(",NA", "", stream)
stream <- gsub("NA,", "", stream)
I need to merge a large list (aprox 15 data frames [16000x6]).
Each data frame has 2 id columns "A" and "B" plus 4 columns with information.
I want to have the first two ("A" and "B" plus 15*4 columns in one data frame).
I have found this in another question:
Reduce(function(x,y) merge(x,y,by="your tag here"),your_list_here)
However this, crashes my machine giving this error because it needs too much RAM (only using a list with 3 dfs!)
In make.unique(as.character(rows)) :
Reached total allocation of 4060Mb: see help(memory.size)
I believe there must be a better strategy, I started with bind_cols from dplyr package and it gets me really fast a data frame with duplicated A and B columns. Maybe removing these columns, keeping the first two, is a better approach.
I provide you a small toy list (the Reduce(...) strategy works here but I need another solution)
dput(mylist)
structure(list(df1 = structure(list(A = c(1, 1, 2, 2, 3, 3),
B = c("Q", "Q", "Q", "P", "P", "P"), x1 = c(0.45840139570646,
0.0418491987511516, 0.798411589581519, 0.898478724062443,
0.064307059859857, 0.174364002654329), x2 = c(0.676136856665835,
0.494200984947383, 0.534940708894283, 0.220597118837759,
0.480761741055176, 0.0230771545320749)), .Names = c("A",
"B", "x1", "x2"), row.names = c(NA, -6L), class = "data.frame"),
df2 = structure(list(A = c(1, 1, 2, 2, 3, 3), B = c("Q",
"Q", "Q", "P", "P", "P"), x1 = c(0.45840139570646, 0.0418491987511516,
0.798411589581519, 0.898478724062443, 0.064307059859857,
0.174364002654329), x2 = c(0.676136856665835, 0.494200984947383,
0.534940708894283, 0.220597118837759, 0.480761741055176,
0.0230771545320749)), .Names = c("A", "B", "x1", "x2"), row.names = c(NA,
-6L), class = "data.frame"), df3 = structure(list(A = c(1,
1, 2, 2, 3, 3), B = c("Q", "Q", "Q", "P", "P", "P"), x1 = c(0.45840139570646,
0.0418491987511516, 0.798411589581519, 0.898478724062443,
0.064307059859857, 0.174364002654329), x2 = c(0.676136856665835,
0.494200984947383, 0.534940708894283, 0.220597118837759,
0.480761741055176, 0.0230771545320749)), .Names = c("A",
"B", "x1", "x2"), row.names = c(NA, -6L), class = "data.frame")), .Names = c("df1",
"df2", "df3"))
For cbind-ing the dataframes you can do:
L <- mylist[[1]]
for (i in 2:length(mylist)) L <- cbind(L, mylist[[i]][-(1:2)])
For merge-ing (as in the former shown (but wrong) expected output for the example):
L <- mylist[[1]]
for (i in 2:length(mylist)) L <- merge(L, mylist[[i]], by=c("A", "B"))
In the case of merge-ing I suppose the need of memory comes from the m:n-connections among the dataframes. This is not solvable by another procedure for merging.
Based on the comment stating you want a 16,000 x 62 data.frame...
First cbind the non ID columns:
tmp <- do.call(cbind, lapply(mylist, function(x) x[,-(1:2)]))
Then add "A" and "B"
final <- cbind(mylist[[1]][,1:2], tmp)
No merging needed, just slap the data.frames together
> final
A B df1.x1 df1.x2 df2.x1 df2.x2 df3.x1 df3.x2
1 1 Q 0.45840140 0.67613686 0.45840140 0.67613686 0.45840140 0.67613686
2 1 Q 0.04184920 0.49420098 0.04184920 0.49420098 0.04184920 0.49420098
3 2 Q 0.79841159 0.53494071 0.79841159 0.53494071 0.79841159 0.53494071
4 2 P 0.89847872 0.22059712 0.89847872 0.22059712 0.89847872 0.22059712
5 3 P 0.06430706 0.48076174 0.06430706 0.48076174 0.06430706 0.48076174
6 3 P 0.17436400 0.02307715 0.17436400 0.02307715 0.17436400 0.02307715
I have 100's of rows I want to edit so I'd rather not do it "manually" via these scripts:
a <-data.frame(name=c("A","B","C","D", b=1:4)
rownames(df) <- a$name
All rows have the same signifier I want to remove, ".meio", such that the rownames are currently:
A.meio, B.meio, C.meio, D.meio ...
I would like the row names to be
A, B, C, D, etc.
How can I do this efficiently?
Thank you.
You can use gsub the function.
Supposedly it works like...
> a <- structure(list(name = structure(1:4, .Label = c("A", "B",
+ "C",
+ "D"), class = "factor"), b = 1:4), .Names = c("name", "b"),
+ row.names = c("A.meio",
+ "B.meio", "C.meio", "D.meio"), class = "data.frame")
> a
name b
A.meio A 1
B.meio B 2
C.meio C 3
D.meio D 4
> row.names(a)=gsub(".meio","",row.names(a))
> a
name b
A A 1
B B 2
C C 3
D D 4
The difference is that sub only replaces the first occurrence of the pattern specified, whereas gsub does it for all occurrences (that is, it replaces globally).
We could use sub to match the pattern . followed by one or more characters (.*) to the end of the string ($) and replace it with ''.
row.names(a) <- sub("\\..*$", '', row.names(a))
NOTE: From the example showed by the OP, it seems that there is only a single instance of .meio, so sub is sufficient.
data
a <- structure(list(name = structure(1:4, .Label = c("A", "B",
"C",
"D"), class = "factor"), b = 1:4), .Names = c("name", "b"),
row.names = c("A.meio",
"B.meio", "C.meio", "D.meio"), class = "data.frame")