Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a dataframe with a column named Stage. The dataframe is generated from a regularly updated excel file.
This column should only have a certain few values in it, such as 'Planning', or 'Analysis', but people occasionally put custom values in and it is impractical to stop.
I want the dataframe sorted by this column, with a custom sort order that makes sense chronologically (e.g for us, planning comes before analysis). I would be able to implement this using factors (e.g. Reorder rows using custom order ), but if I use a predefined list of factors, I lose any unexpected values that people enter into that column. I am happy for the unexpected values not to be sorted properly but I don't want to lose them entirely.
EDIT: Answer by floo0 is amazing, but I neglected to mention that I was planning on barplotting the results, something like
barplot(table(MESH_assurance_involved()[MESH_assurance_involved_sort_order(), 'Stage']), main="Stage became involved")
(parentheses because these are shiny reactive objects, shouldn't make a difference).
The results are unsorted, although testing in the console reveals the underlying data is sorted.
table is also breaking the sorting but using ggplot and no table I get the identical result.
To display a barplot maintaining the source order seems to require something like Ordering bars in barplot() but all solutions I have found require factors, and mixing them with the solution here is not working for me somehow.
Toy data-set:
dat <- data.frame(Stage = c('random1', 'Planning', 'Analysis', 'random2'), id=1:4,
stringsAsFactors = FALSE)
So dat looks as follows:
> dat
Stage id
1 random1 1
2 Planning 2
3 Analysis 3
4 random2 4
Now you can do something like this:
known_levels <- c('Planning', 'Analysis')
my_order <- order(factor(dat$Stage, levels = known_levels, ordered=TRUE))
dat[my_order, ]
Which gives you
Stage id
2 Planning 2
3 Analysis 3
1 random1 1
4 random2 4
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
All, I'm trying to access data frames from the content of a variable, so the process can be automated in R.
Let's say I have 10 data frames with unordered names, containing item numbers. I'm trying to merge these data frames one by one with a purchase record, matched by the item primary key. This is a straightforward challenge for one or few data frames, with a larger number, but it's really cumbersome for a large number of dataframes.
dfs <- c("Chocolate", "Gum", "Cookies", "PotatoChips", "HotSauce", "Bread", "Yogurt", "Shampoo", "BodyWash", "ShoePolish")
for (i in 1:length(dfs)) {
assign(paste("trx_",dfs[i],sep=""), merge(get(dfs[i]),trx,by="item_no")) }
So, I want to automatically create data frames, e.g. trx_Chocolate, trx_Gum, containing the merged records rather than doing it one by one. The issue is with the merge as it produces an error message about me not having a valid column name - presumably due to dynamically addressing the data frames through the content of a list variable.
I know that there's a possible solution as well in storing the data frames as .CSV, and then reading them one by one back again and merge the data frames that way. However, I'm trying not to create excessive intermediary files if I can help it.
Any advice or help would be much appreciated.
Thank you.
In trying to answer your question, I created a reproducible example. (In the future, I would recommend you include a reprex.)
Your code actually appears to work just fine. See the example below.
As a next step, I would confirm that each of the data.frames whose names are in the vector df actually have the column "item_no." Also confirm trx has this column. Otherwise, this error does not make sense.
I would also encourage you to explore options where you do not create different data.frames in the first place. Dynamically referencing/assigning data.frames can cause unexpected challenges -- and makes your code less readable.
You can potentially keep everything in the same, long data.frame and subset out just the items that you need when automating the process. At first glance, this might seem tricky but if possible it might well simplify a lot of the issues you are encountering.
If you need additional assistance please consider posting a reproducible example that further illustrates the issues you are having.
dfs <- c("Chocolate", "Gum", "Cookies", "PotatoChips", "HotSauce", "Bread", "Yogurt", "Shampoo", "BodyWash", "ShoePolish")
for (i in 1:length(dfs)) {
assign(paste("trx_",dfs[i],sep=""), merge(get(dfs[i]),trx,by="item_no")) }
I have created a reproducible example, and your code works fine.
First create some dummy data:
trx <- data.frame('item_no' = paste0('item_',1:10))
Chocolate <- data.frame('item_no' = paste0('item_',1:5), 'col1' = 1:5)
Cookies <- data.frame('item_no' = paste0('item_',5:7), 'col1' = 1)
Run your code:
dfs <- c('Chocolate', 'Cookies')
for (i in 1:length(dfs)) {
assign(paste0('trx_',dfs[i]), merge(get(dfs[i]), trx, by="item_no")) }
View output:
> trx_Chocolate
item_no col1
1 item_1 1
2 item_2 2
3 item_3 3
4 item_4 4
5 item_5 5
> trx_Cookies
item_no col1
1 item_5 1
2 item_6 1
3 item_7 1
If you do not have item_no in both the data frames you are trying to merge, you will receive the error: Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
There are a total of 100 data from six teams' basketball games. I wrote the R code to see which team wins in each game like this.
win = ifelse(dat$away_score > dat$home_score, dat$away, dat$home)
However, the name of the basketball team is not output but is output as a number (1,2,3, ..). Of course,
After naming the basketball teams in alphabetical order, numbers were assigned according to their order. At this time, how do I print the results in the name of the original basketball team rather than numbers?
Seems like the columns are factor. We could convert the factor to character class and then it would work
ifelse(dat$away_score > dat$home_score, as.character(dat$away), as.character(dat$home))
Not sure what dat looks like, but if I do this:
dat <- c()
dat$home <- c("a","b","c") # home team names
dat$away <- c("d","e","f") # away team names
dat$away_score <- c(90,80,70)
dat$home_score <- c(89,81,69)
win = ifelse(dat$away_score > dat$home_score, dat$away, dat$home)
win # print results
I get the following showing the "name" of which team won:
[1] "d" "b" "f"
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have a data frame df with multiple factor columns, say column A with factors a,b,c, column B with factors m, f and so on.
Each of these columns have NA's.
How can I fill NA's with a, b, c and m, f according to their
distribution in the column (for example if I have 50% males and 50%
females (for simplicity) I will fill my NA's 50% as males, 50% as females)?
Is it a good technique if I have around 550 observations of data and 41
columns?
Next step will be to resample it to make the data set bigger and apply ML on the data set - please tell me which function will enlarge this data set to be 10000 observations or more?
Thanks in advance!
You could use the following code (please see below a couple of comments) (I created a small data frame to give you a concrete example)
A_ <- c(rep("a", 10), rep("b", 60), rep("c", 30), rep(NA, 200))
A <- data.frame(A_)
names(A) <- c("A")
b <- sample(c("a","b","c"), size = 200, prob = c(10,60,30)/100,replace = TRUE)
A[is.na(A)] <- b
And you can check with
table(A)
Now you should be careful in changing the NA values. First of all I would check why you have NA. Maybe there is a reason why the information is not there. Changing the NA with values from a distribution, you automatically assume that the missing data follows the same distribution. Is really so? 550 observations are really not that many to talk about distribution. Maybe you need to simply ignore the records with NA?
Regarding your second question you cannot simply generate new data from your existing one. In some cases (see images for example, where you can tilt them, shift them and so on) you can "augment" your data set. With small data set as the one you describe I would not do it.
It all depends on the kind of data you have. But my first impression is that in your cases, with 41 factors and many NA, you cannot simply augment your data.
Possibly knowing a bit more about your dataset could help us in giving you a more precise help.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a dataframe of the correlation between 45 variables, and have added the the random forest importance value given to each by the 'varImp' function (I ran a random forest training model with this data.
I would like to run through each column, and wherever a variable has a correlation over .8 (in absolute terms), remove either that row variable or that column variable, whichever has the lower 'varImp' importance. I would also like to remove the same variable from the column/row (since it's a correlation matrix, all variables show up in both a row and a column).
For example, roll_belt and max_picth_belt have a correlation of ~.97, and because roll_belt has a value of 3.77 compared to max_picth_belt's 3.16, I would like to delete max_pitch_belt both as a row, and as a column.
Thanks for your help!
I'm sure there must be a more straightforward way. Still, my code does the job.
Assume, we've loaded your dataset into an object called df (I do not include the code to get your data as it is not relevant).
First, it's handy to I split the data itself and the value column that is used for testing feature importance. New object called test.value is the 46-th column.
test.value <- df$value
df <- df[,-ncol(df)] # remove the last column from the dataset
Now we are ready to start.
The framework. We need to identify the numbers of rows/columns to remove from the dataset. So we will:
go column by column
identify the positions of all correlates bigger than 0.8
compare feature importance one by one in a nested loop
record the row/column numbers that should be removed in an object
remove
finally, remove the chosen rows/columns
The code is:
remove <- c() # a vector to store features to be removed
for(i in 1:ncol(df)){
coli <- df[,i] # pick up i-th column
highcori <- coli>.8 & coli!=1 # logical vector of cors > 0.8
# go further only if there are cors > 0.8
if(sum(highcori,na.rm = T)>0){
posi <- which(highcori) # identify positions of cors > 0.8
# compare feature importance one by one
for(k in 1:length(posi)){
remi <- ifelse(test.value[i]>test.value[posi[k]],posi[k],i)
remove <- c(remove,remi) # store the less valued feature
}
}
}
remove <- sort(unique(remove)) # keep only unique entries
df.clean <- df[-remove,-remove] # finally, clean the dataset
That's it.
UPDATE
For those who can provide a better solution, here are the data in an easily readable form, cor.remove.RData
OR
if you prefer dput
dput.df.txt
dput.test.value.txt
I would be interested to see a better way of solving the task.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have CSV data as follows:
code, label, value
ABC, len, 10
ABC, count, 20
ABC, data, 102
ABC, data, 212
ABC, data, 443
...
XYZ, len, 11
XYZ, count, 25
XYZ, data, 782
...
The number of data entries is different for each code. (This doesn't matter for my question; I'm just point it out.)
I need to analyze the data entries for each code. This would include calculating the median, plotting graphs, etc. This means I should separate out the data for each code and make it numeric?
Is there a better way of doing this than this kind of thing:
x = read.csv('dataFile.csv, header=T)
...
median(as.numeric(subset(x, x$code=='ABC' & x$label=='data')$value))
boxplot(median(as.numeric(subset(x, x$code=='ABC' & x$label=='data')$value)))
split and list2env allows you to separate your data.frame x for each code generating one data.frame for each level in code:
list2env(split(x, x$code), envir=.GlobalEnv)
or just
my.list <- split(x, x$code)
if you prefer to work with lists.
I'm not sure I totally understand the final objective of your question, do you just want some pointers of what you could do it? because there are a lot of possible solutions.
When you ask: I need to analyze the data entries for each code. This would include calculating the median, plotting graphs, etc. This means I should separate out the data for each code and make it numeric?
The answer would be no, you don't strictly have to. You could use R functions which does this task for you, for example:
x = read.csv('dataFile.csv', header=T)
#is it numeric?
class(x$value)
# if it is already numeric you shouldn't have to convert it,
# if it strictly numeric I don't know any reason why it
# should be read as strings but it happens.
aggregate(x,by=list(x$code),FUN="median")
boxplot(value~code,data=x)
# and you can do ?boxplot to look into its options.