How to I edit a large number of row names in R? - r

I have 100's of rows I want to edit so I'd rather not do it "manually" via these scripts:
a <-data.frame(name=c("A","B","C","D", b=1:4)
rownames(df) <- a$name
All rows have the same signifier I want to remove, ".meio", such that the rownames are currently:
A.meio, B.meio, C.meio, D.meio ...
I would like the row names to be
A, B, C, D, etc.
How can I do this efficiently?
Thank you.

You can use gsub the function.
Supposedly it works like...
> a <- structure(list(name = structure(1:4, .Label = c("A", "B",
+ "C",
+ "D"), class = "factor"), b = 1:4), .Names = c("name", "b"),
+ row.names = c("A.meio",
+ "B.meio", "C.meio", "D.meio"), class = "data.frame")
> a
name b
A.meio A 1
B.meio B 2
C.meio C 3
D.meio D 4
> row.names(a)=gsub(".meio","",row.names(a))
> a
name b
A A 1
B B 2
C C 3
D D 4
The difference is that sub only replaces the first occurrence of the pattern specified, whereas gsub does it for all occurrences (that is, it replaces globally).

We could use sub to match the pattern . followed by one or more characters (.*) to the end of the string ($) and replace it with ''.
row.names(a) <- sub("\\..*$", '', row.names(a))
NOTE: From the example showed by the OP, it seems that there is only a single instance of .meio, so sub is sufficient.
data
a <- structure(list(name = structure(1:4, .Label = c("A", "B",
"C",
"D"), class = "factor"), b = 1:4), .Names = c("name", "b"),
row.names = c("A.meio",
"B.meio", "C.meio", "D.meio"), class = "data.frame")

Related

Insert specific elements in specific locations of a list

I have unequal sized list and I want to append specific items from one list to specific positions in another list
First list
dat <- structure(list(supergrp = c("D", "A", "P", "B"),
clusters = c("1", "2", "3", "1"),
items = structure(list(`1.2` = c("a", "c", "d"),
`2.1` = "b", `3` = "e", `4` = c("e", "b")),
.Names = c("1.2", "2.1", "3", "4"))),
.Names = c("supergrp", "clusters", "items"),
row.names = c(NA, 4L), class = "data.frame")
second list
val_to_append <- structure(list(supergrp = c("D", "A"),
clusters = c(1, 2),
items = structure(list(`1.2` = c("c", "f"),
`2.1` = c("c", "d", "e")),
.Names = c("1.2", "2.1"))),
.Names = c("supergrp", "clusters", "items"),
row.names = c(NA, -2L), class = "data.frame")
I want to append val_to_append$item[[1]] to dat$item[[3]]
Similarly, I want to append item val_to_append$item[[2]] to dat$item[[1]]
The required output is
supergrp clusters items
1 D 1 a, c, d, e
2 A 2 b
3 P 3 e, c, f
4 B 1 e, b
I can do this in loop
dat_indx <- c(3,1)
val_indx <- c(1,2)
fin_result <- dat
for(i in seq_along(dat_indx)) {
out_put_indx <- dat_indx[[i]]
fin_result$items[[dat_indx[[i]]]] <- unique(c(fin_result$items[[dat_indx[[i]]]],
val_to_append$items[[val_indx[[i]]]]))
}
I tried normal vector indexing such as
append(fin_result$items[[dat_indx]], val_to_append$items[[val_indx]])
without success. Is there an efficient way to do this because my list, aka, dataframe is very large hundreds of thousands of samples.
I am thinking of sapply but don't have concrete idea
We can use mapply to achieve this. We append the values from val_to_append$items to dat$items using the index value which is known before hand.
dat_indx <- c(3,1)
val_indx <- c(1,2)
dat$items[dat_indx] <- mapply(function(x, y)
unique(c(dat$items[[x]], val_to_append$items[[y]])), dat_indx, val_indx)
dat
# supergrp clusters items
#1 D 1 a, c, d, e
#2 A 2 b
#3 P 3 e, c, f
#4 B 1 e, b
Although, this is another way of solving the problem I doubt how efficient it is going to be.

R multiple choice questionnaire data to ggplot

I have a Qualtrics multiple choice question that I want to use to create graphs in R. My data is organized so that you can answer multiple answers for each question. For example, participant 1 selected multiple choice answers 1 (Q1_1) & 3 (Q1_3). I want to collapse all answer choices in one bar graph, one bar for each multiple response option (Q1_1:Q1_3) divided by the number of respondents who answered this question (in this case, 3).
df <- structure(list(Participant = 1:3, A = c("a", "a", ""), B = c("", "b", "b"), C = c("c", "c", "c")), .Names = c("Participant", "Q1_1", "Q1_2", "Q1_3"), row.names = c(NA, -3L), class = "data.frame")
I want to use ggplot2 and maybe some sort of loop through Q1_1: Q1_3?
Perhaps this is what you want
f <-
structure(
list(
Participant = 1:3,
A = c("a", "a", ""),
B = c("", "b", "b"),
C = c("c", "c", "c")),
.Names = c("Participant", "Q1_1", "Q1_2", "Q1_3"),
row.names = c(NA, -3L),
class = "data.frame"
)
library(tidyr)
library(dplyr)
library(ggplot2)
nparticipant <- nrow(f)
f %>%
## Reformat the data
gather(question, response, starts_with("Q")) %>%
filter(response != "") %>%
## calculate the height of the bars
group_by(question) %>%
summarise(score = length(response)/nparticipant) %>%
## Plot
ggplot(aes(x=question, y=score)) +
geom_bar(stat = "identity")
Here is a solution using ddply from dplyr package.
# I needed to increase number of participants to ensure it works in every case
df = data.frame(Participant = seq(1:100),
Q1_1 = sample(c("a", ""), 100, replace = T, prob = c(1/2, 1/2)),
Q1_2 = sample(c("b", ""), 100, replace = T, prob = c(2/3, 1/3)),
Q1_3 = sample(c("c", ""), 100, replace = T, prob = c(1/3, 2/3)))
df$answer = paste0(df$Q1_1, df$Q1_2, df$Q1_3)
summ = ddply(df, c("answer"), summarize, freq = length(answer)/nrow(df))
## Re-ordeing of factor levels summ$answer
summ$answer <- factor(summ$answer, levels=c("", "a", "b", "c", "ab", "ac", "bc", "abc"))
# Plot
ggplot(summ, aes(answer, freq, fill = answer)) + geom_bar(stat = "identity") + theme_bw()
Note : it might be more complicated if you have more columns relating to other questions ("Q2_1", "Q2_2"...). In this case, melting data for each question could be a solution.
I think you want something like this (proportion with a stacked bar chart):
Participant Q1_1 Q1_2 Q1_3
1 1 a c
2 2 a a c
3 3 c b c
4 4 b d
# ensure that all question columns have the same factor levels, ignore blanks
for (i in 2:4) {
df[,i] <- factor(df[,i], levels = c(letters[1:4]))
}
tdf <- as.data.frame(sapply(df[2:4], function(x)table(x)/sum(table(x))))
tdf$choice <- rownames(tdf)
tdf <- melt(tdf, id='choice')
ggplot(tdf, aes(variable, value, fill=choice)) +
geom_bar(stat='identity') +
xlab('Questions') +
ylab('Proportion of Choice')

R Creating Dynamic variables from group aggregated set of DataFrames

My problem statement is I have a list of dataframes as df1,df2,df3.Data is like
df1
a,b,c,d
1,2,3,4
1,2,3,4
df2
a,b,c,d
1,2,3,4
1,2,3,4
Now, for these two dataframe I should create a new dataframe taking aggregated column of those two dataframes ,for that I am using below code
for(i in 1:2){
assign(paste(final_val,i,sep=''),sum(assign(paste(df,i,sep='')))$d*100)}
I am getting the error:
Error in assign(paste(hvp_route_dsct_clust, i, sep = "")) :
argument "value" is missing, with no default
My output should look like
final_val1 <- 800
final_val2 <- 800
And for those values final_val1,final_val2 I should be creating dataframe dynamicaly
Can anybody please help me on this
If we need to use assign, get the object names from the global environment with ls by specifying the pattern 'df' followed by one or more numbers (\\d+), create another vector of 'final_val's ('nm1'), loop through the sequence of 'nm1', assign each of the element in 'nm2' to the value we got from extracting the column 'd' of each 'df's multiplied by 100 and taking its sum.
nm1 <- ls(pattern = "df\\d+")
nm2 <- paste0("final_val", seq_along(nm1))
for(i in seq_along(nm1)){
assign(nm2[i], sum(get(nm1[i])$d*100))
}
final_val1
#[1] 800
final_val2
#[1] 800
Otherwise, we place the datasets in a list, extract the 'd' column, multiply with 100 and do the column sums
unname(colSums(sapply(mget(nm1), `[[`, 'd') * 100))
#800 800
data
df1 <- structure(list(a = c(1L, 1L), b = c(2L, 2L), c = c(3L, 3L), d = c(4L,
4L)), .Names = c("a", "b", "c", "d"), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(a = c(1L, 1L), b = c(2L, 2L), c = c(3L, 3L), d = c(4L,
4L)), .Names = c("a", "b", "c", "d"), class = "data.frame", row.names = c(NA,
-2L))

How to replicate rows in dataframe for every comma separated item in a column [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have this dataframe called mydf. What I need to do is replicate the rows for every item separated by comma in the cd column and get the result as shown in result.
mydf<-structure(list(cc = structure(1:3, .Label = c("a", "b", "c"), class = "factor"),
cd = structure(1:3, .Label = c("e,f,g", "f,g,s", "g,h,g"), class = "factor"),
individuals = structure(1:3, .Label = c("apple", "ball",
"cat"), class = "factor")), .Names = c("cc", "cd", "individuals"
), row.names = c(NA, -3L), class = "data.frame")
result
cc cd individuals
a e apple
a f apple
a g apple
b f ball
b g ball
b s ball
c g cat
c h cat
c g cat
dplyr way
library(stringi)
library(dplyr)
library(tidyr)
mydf %>%
mutate(cd = cd %>% stri_split_fixed(",") ) %>%
unnest(cd)

Merge large list of data frames into one data frame by columns

I need to merge a large list (aprox 15 data frames [16000x6]).
Each data frame has 2 id columns "A" and "B" plus 4 columns with information.
I want to have the first two ("A" and "B" plus 15*4 columns in one data frame).
I have found this in another question:
Reduce(function(x,y) merge(x,y,by="your tag here"),your_list_here)
However this, crashes my machine giving this error because it needs too much RAM (only using a list with 3 dfs!)
In make.unique(as.character(rows)) :
Reached total allocation of 4060Mb: see help(memory.size)
I believe there must be a better strategy, I started with bind_cols from dplyr package and it gets me really fast a data frame with duplicated A and B columns. Maybe removing these columns, keeping the first two, is a better approach.
I provide you a small toy list (the Reduce(...) strategy works here but I need another solution)
dput(mylist)
structure(list(df1 = structure(list(A = c(1, 1, 2, 2, 3, 3),
B = c("Q", "Q", "Q", "P", "P", "P"), x1 = c(0.45840139570646,
0.0418491987511516, 0.798411589581519, 0.898478724062443,
0.064307059859857, 0.174364002654329), x2 = c(0.676136856665835,
0.494200984947383, 0.534940708894283, 0.220597118837759,
0.480761741055176, 0.0230771545320749)), .Names = c("A",
"B", "x1", "x2"), row.names = c(NA, -6L), class = "data.frame"),
df2 = structure(list(A = c(1, 1, 2, 2, 3, 3), B = c("Q",
"Q", "Q", "P", "P", "P"), x1 = c(0.45840139570646, 0.0418491987511516,
0.798411589581519, 0.898478724062443, 0.064307059859857,
0.174364002654329), x2 = c(0.676136856665835, 0.494200984947383,
0.534940708894283, 0.220597118837759, 0.480761741055176,
0.0230771545320749)), .Names = c("A", "B", "x1", "x2"), row.names = c(NA,
-6L), class = "data.frame"), df3 = structure(list(A = c(1,
1, 2, 2, 3, 3), B = c("Q", "Q", "Q", "P", "P", "P"), x1 = c(0.45840139570646,
0.0418491987511516, 0.798411589581519, 0.898478724062443,
0.064307059859857, 0.174364002654329), x2 = c(0.676136856665835,
0.494200984947383, 0.534940708894283, 0.220597118837759,
0.480761741055176, 0.0230771545320749)), .Names = c("A",
"B", "x1", "x2"), row.names = c(NA, -6L), class = "data.frame")), .Names = c("df1",
"df2", "df3"))
For cbind-ing the dataframes you can do:
L <- mylist[[1]]
for (i in 2:length(mylist)) L <- cbind(L, mylist[[i]][-(1:2)])
For merge-ing (as in the former shown (but wrong) expected output for the example):
L <- mylist[[1]]
for (i in 2:length(mylist)) L <- merge(L, mylist[[i]], by=c("A", "B"))
In the case of merge-ing I suppose the need of memory comes from the m:n-connections among the dataframes. This is not solvable by another procedure for merging.
Based on the comment stating you want a 16,000 x 62 data.frame...
First cbind the non ID columns:
tmp <- do.call(cbind, lapply(mylist, function(x) x[,-(1:2)]))
Then add "A" and "B"
final <- cbind(mylist[[1]][,1:2], tmp)
No merging needed, just slap the data.frames together
> final
A B df1.x1 df1.x2 df2.x1 df2.x2 df3.x1 df3.x2
1 1 Q 0.45840140 0.67613686 0.45840140 0.67613686 0.45840140 0.67613686
2 1 Q 0.04184920 0.49420098 0.04184920 0.49420098 0.04184920 0.49420098
3 2 Q 0.79841159 0.53494071 0.79841159 0.53494071 0.79841159 0.53494071
4 2 P 0.89847872 0.22059712 0.89847872 0.22059712 0.89847872 0.22059712
5 3 P 0.06430706 0.48076174 0.06430706 0.48076174 0.06430706 0.48076174
6 3 P 0.17436400 0.02307715 0.17436400 0.02307715 0.17436400 0.02307715

Resources