Dynamically accessing a dataframe from a variable [closed] - r

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
All, I'm trying to access data frames from the content of a variable, so the process can be automated in R.
Let's say I have 10 data frames with unordered names, containing item numbers. I'm trying to merge these data frames one by one with a purchase record, matched by the item primary key. This is a straightforward challenge for one or few data frames, with a larger number, but it's really cumbersome for a large number of dataframes.
dfs <- c("Chocolate", "Gum", "Cookies", "PotatoChips", "HotSauce", "Bread", "Yogurt", "Shampoo", "BodyWash", "ShoePolish")
for (i in 1:length(dfs)) {
assign(paste("trx_",dfs[i],sep=""), merge(get(dfs[i]),trx,by="item_no")) }
So, I want to automatically create data frames, e.g. trx_Chocolate, trx_Gum, containing the merged records rather than doing it one by one. The issue is with the merge as it produces an error message about me not having a valid column name - presumably due to dynamically addressing the data frames through the content of a list variable.
I know that there's a possible solution as well in storing the data frames as .CSV, and then reading them one by one back again and merge the data frames that way. However, I'm trying not to create excessive intermediary files if I can help it.
Any advice or help would be much appreciated.
Thank you.

In trying to answer your question, I created a reproducible example. (In the future, I would recommend you include a reprex.)
Your code actually appears to work just fine. See the example below.
As a next step, I would confirm that each of the data.frames whose names are in the vector df actually have the column "item_no." Also confirm trx has this column. Otherwise, this error does not make sense.
I would also encourage you to explore options where you do not create different data.frames in the first place. Dynamically referencing/assigning data.frames can cause unexpected challenges -- and makes your code less readable.
You can potentially keep everything in the same, long data.frame and subset out just the items that you need when automating the process. At first glance, this might seem tricky but if possible it might well simplify a lot of the issues you are encountering.
If you need additional assistance please consider posting a reproducible example that further illustrates the issues you are having.
dfs <- c("Chocolate", "Gum", "Cookies", "PotatoChips", "HotSauce", "Bread", "Yogurt", "Shampoo", "BodyWash", "ShoePolish")
for (i in 1:length(dfs)) {
assign(paste("trx_",dfs[i],sep=""), merge(get(dfs[i]),trx,by="item_no")) }

I have created a reproducible example, and your code works fine.
First create some dummy data:
trx <- data.frame('item_no' = paste0('item_',1:10))
Chocolate <- data.frame('item_no' = paste0('item_',1:5), 'col1' = 1:5)
Cookies <- data.frame('item_no' = paste0('item_',5:7), 'col1' = 1)
Run your code:
dfs <- c('Chocolate', 'Cookies')
for (i in 1:length(dfs)) {
assign(paste0('trx_',dfs[i]), merge(get(dfs[i]), trx, by="item_no")) }
View output:
> trx_Chocolate
item_no col1
1 item_1 1
2 item_2 2
3 item_3 3
4 item_4 4
5 item_5 5
> trx_Cookies
item_no col1
1 item_5 1
2 item_6 1
3 item_7 1
If you do not have item_no in both the data frames you are trying to merge, you will receive the error: Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column.

Related

How to compute questionnaire total score and subscores by summing all and a selection of columns in R?

I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected

How to filter column for same name but different number using dplyr [duplicate]

This question already has answers here:
Filter rows which contain a certain string
(5 answers)
Selecting data frame rows based on partial string match in a column
(4 answers)
Closed 4 years ago.
I apologize in advance if the title of this post isn't accurate. I already know that this is a super easy question and if I knew the correct terminology I probably could find a pervious post about this.
So what I am attempting to do is filter my data using dplyr for a gene family. Here is an example so it makes a bit more sense.
I have a gene family called ADCY but what comprises that family is 10 seperate genes. So the family looks like this
ADCY1
ADCY2
ADCY3
ADCY4
ADCY5
ADCY6
ADCY7
ADCY8
ADCY9
ADCY10
I know I can do something like this but it is kind of annoying to have to type out all 10 genes, especially when I have a bunch of other gene families I want to look at.
genes <- c("ADCY1", "ADCY2", "ADCY3", "ADCY4", "ADCY5", "ADCY6", "ADCY7",
"ADCY8", "ADCY9", "ADCY10")`
df_filtered <- df %>%
filter(symbol %in% genes)
I was wondering if there was a was to use dpylr and filter for just maybe the start of the gene name? If that makes sense? I know there is a starts_with("ADCY") that I can use, but my R session crashes when I try and use that with the filter option. I was wondering if anyone had some solutions!
You can use the good old (I mean no dependency required) paste0:
paste0("ADCY", 1:10)
[1] "ADCY1" "ADCY2" "ADCY3" "ADCY4" "ADCY5" "ADCY6" "ADCY7" "ADCY8" "ADCY9"
[10] "ADCY10"

In R, It comes out as a number that tries to extract characters that match the condition using the ifelse syntax [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
There are a total of 100 data from six teams' basketball games. I wrote the R code to see which team wins in each game like this.
win = ifelse(dat$away_score > dat$home_score, dat$away, dat$home)
However, the name of the basketball team is not output but is output as a number (1,2,3, ..). Of course,
After naming the basketball teams in alphabetical order, numbers were assigned according to their order. At this time, how do I print the results in the name of the original basketball team rather than numbers?
Seems like the columns are factor. We could convert the factor to character class and then it would work
ifelse(dat$away_score > dat$home_score, as.character(dat$away), as.character(dat$home))
Not sure what dat looks like, but if I do this:
dat <- c()
dat$home <- c("a","b","c") # home team names
dat$away <- c("d","e","f") # away team names
dat$away_score <- c(90,80,70)
dat$home_score <- c(89,81,69)
win = ifelse(dat$away_score > dat$home_score, dat$away, dat$home)
win # print results
I get the following showing the "name" of which team won:
[1] "d" "b" "f"

Custom sort without using factor [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a dataframe with a column named Stage. The dataframe is generated from a regularly updated excel file.
This column should only have a certain few values in it, such as 'Planning', or 'Analysis', but people occasionally put custom values in and it is impractical to stop.
I want the dataframe sorted by this column, with a custom sort order that makes sense chronologically (e.g for us, planning comes before analysis). I would be able to implement this using factors (e.g. Reorder rows using custom order ), but if I use a predefined list of factors, I lose any unexpected values that people enter into that column. I am happy for the unexpected values not to be sorted properly but I don't want to lose them entirely.
EDIT: Answer by floo0 is amazing, but I neglected to mention that I was planning on barplotting the results, something like
barplot(table(MESH_assurance_involved()[MESH_assurance_invol‌​ved_sort_order(), 'Stage']), main="Stage became involved")
(parentheses because these are shiny reactive objects, shouldn't make a difference).
The results are unsorted, although testing in the console reveals the underlying data is sorted.
table is also breaking the sorting but using ggplot and no table I get the identical result.
To display a barplot maintaining the source order seems to require something like Ordering bars in barplot() but all solutions I have found require factors, and mixing them with the solution here is not working for me somehow.
Toy data-set:
dat <- data.frame(Stage = c('random1', 'Planning', 'Analysis', 'random2'), id=1:4,
stringsAsFactors = FALSE)
So dat looks as follows:
> dat
Stage id
1 random1 1
2 Planning 2
3 Analysis 3
4 random2 4
Now you can do something like this:
known_levels <- c('Planning', 'Analysis')
my_order <- order(factor(dat$Stage, levels = known_levels, ordered=TRUE))
dat[my_order, ]
Which gives you
Stage id
2 Planning 2
3 Analysis 3
1 random1 1
4 random2 4

Analyze CSV data in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have CSV data as follows:
code, label, value
ABC, len, 10
ABC, count, 20
ABC, data, 102
ABC, data, 212
ABC, data, 443
...
XYZ, len, 11
XYZ, count, 25
XYZ, data, 782
...
The number of data entries is different for each code. (This doesn't matter for my question; I'm just point it out.)
I need to analyze the data entries for each code. This would include calculating the median, plotting graphs, etc. This means I should separate out the data for each code and make it numeric?
Is there a better way of doing this than this kind of thing:
x = read.csv('dataFile.csv, header=T)
...
median(as.numeric(subset(x, x$code=='ABC' & x$label=='data')$value))
boxplot(median(as.numeric(subset(x, x$code=='ABC' & x$label=='data')$value)))
split and list2env allows you to separate your data.frame x for each code generating one data.frame for each level in code:
list2env(split(x, x$code), envir=.GlobalEnv)
or just
my.list <- split(x, x$code)
if you prefer to work with lists.
I'm not sure I totally understand the final objective of your question, do you just want some pointers of what you could do it? because there are a lot of possible solutions.
When you ask: I need to analyze the data entries for each code. This would include calculating the median, plotting graphs, etc. This means I should separate out the data for each code and make it numeric?
The answer would be no, you don't strictly have to. You could use R functions which does this task for you, for example:
x = read.csv('dataFile.csv', header=T)
#is it numeric?
class(x$value)
# if it is already numeric you shouldn't have to convert it,
# if it strictly numeric I don't know any reason why it
# should be read as strings but it happens.
aggregate(x,by=list(x$code),FUN="median")
boxplot(value~code,data=x)
# and you can do ?boxplot to look into its options.

Resources