Rescaling multiple columns of dataframe between specific ranges in R - r

I'm wanting to rescale multiple columns of a dataframe between a specific range (1 and 20) in R/R Studio. While I can get it to work for a single column, I cant seem tog et it wo work for multiple. The real data contains many columns, so some sort of indexing would be ideal if possible. I'm sure it's probably something simple, but cant seem to figure out what I'm missing. Any help would be appreciated. Thanks
# This works on a single column
library(scales)
single = c(100,90,80,70,60,50,40,30,20,10)
rescale(single , to=c(1,20))
# This does not work
library(scales)
multiple = data.frame(V0 = c("A","B","C","D","E","F", "G", "H", "I", "J"),
V1= c(1,2,3,4,5,6,7,8,9,10),
V2= c(100,90,80,70,60,50,40,30,20,10)
)
rescale(multiple[,c(2,3)], to=c(1,20))

You are looking for:
multiple[,c(2,3)] <-lapply(multiple[,c(2,3)], rescale, to=c(1,20))

Related

Read specific columns starting from certain rows from excel file using readxl package in R

I'm trying to read an excel file into R. I need to read column A and column C (no B), starting from row 5. Here is what I did:
library(readxl)
read_excel('./data/temp.xlsx', skip=5,
range=cell_cols(c('A', 'C')))
The code above does not work. First, it does not skip 5 rows. It reads from first row. Secondly, it also read column B, which I do not want.
Does anyone know what I did wrong? I know how to specify the cell range, but how should I pick the specific columns I need?
You can use the column_types argument (check ?read_excel) to skip columns from being read. For instance, if columns A and C are numeric:
readxl::read_excel("/path/to/data.xlsx",
col_names = FALSE,
skip = 5,
col_types=c("numeric", "skip", "numeric"))
NB: if the column types are unknown initially you could read them as text and convert them afterwards.
Borrowing the content from readxl.tidyverse.org. One of your questions regarding why column B is also added is because:
## columns only
read_excel(..., range = cell_cols(1:26))
## is equivalent to all of these
read_excel(..., range = cell_cols(c(1, 26)))
read_excel(..., range = cell_cols("A:Z"))
read_excel(..., range = cell_cols(LETTERS))
read_excel(..., range = cell_cols(c("A", "Z"))
Hence, cell_cols("A:C") is equivalent to cell_cols(c("A", "C"))
Previously, what I did was in one of my projects was the following. I guess you can adapt the following and extract the data by column, then join them together.
ranges = list("A5:H18", "A28:H39", "A50:H61")
extracted <- lapply(ranges, function(each_range){
read_excel(filepath, sheet = 1, range = each_range, na = c("", "-"), col_names = cname, col_types = ctype)
}) %>%
reduce(full_join)
Regarding your question about skipping rows, I'm also not sure because I was also searching for this answer, and found your question on stackoverflow.
[edit] I think I found some readings on https://github.com/tidyverse/readxl/issues/577. Anyway, if you use range, you can't do any skip, as range takes precedence over skip and others

order function only partially reordering dataframe

I have created a data frame using rbind() to append two data frames with the same row names together. I am then trying to use the order() function to order the factor levels alphabetically. However, it is still treating the data frames as two separate objects, and ordering the first alphabetically, and then the second alphabetically separately.
Example:
df1 <- data.frame(site=c("A", "F", "C"))
df2 <- data.frame(site=c("B", "G", "D"))
new.df <- rbind(df1, df2)
new.df <- new.df[order(new.df$site),]
outcome:
site
A
C
F
B
D
G
I have looked at other methods of reordering data, for example using the arrange function from package dplyr, but have not had any success. Any suggestions of how to fix this?
Any help much appreciated.
Thanks
Avoid creation of factors by
df1 <- data.frame(site=c("A", "F", "C"), stringsAsFactors = FALSE)
df2 <- data.frame(site=c("B", "G", "D"), stringsAsFactors = FALSE)
then the remaining stuff will work as expected.
I'm guessing you're not doing quite what you think you're doing there: the resulting new.df isn't a data frame any more, it's a factor. The result of order is to put it in the order of the levels of the factor (see levels(new.df$site). So, if you really want to do it this way (ie, keeping it as a factor rather than a character vector), you will need to reorder the levels first.
new.df$site <- factor(new.df$site, levels = sort(levels(new.df$site)))
new.df[order(new.df$site), ]
[1] A B C D F G
Levels: A B C D F G
But unless you really need it to be a factor from the start, I think you would be best advised to do what #Uwe Block suggests and, if necessary, turn it in to a factor after you've used rbind and done the sorting.

Setting column classes and defining any factor levels, without using loops in R

I have been grappling with the following problem for a while, as I need to load in, manipulate, and produce scores from new datasets as quickly as possible. I have defined a data dictionary containing a description of each variable class (e.g. numeric, factor, character, date) and, where applicable, a list of all possible factor levels:
DD <- data.frame(Var = c("a", "b", "c", "d"),
Class = c("Numeric", "Factor", "Factor", "Date"),
Levels = c(NA, "B1, B2, B3", "C1, C2", NA))
Data <- data.frame(a = 5, b = "B1", c = "C2", d = "2015-05-01")
Ultimately, I intend to use model.matrix to produce a design matrix with a common set of indicator variables/ columns regardless of the actual factor levels observed in the particular dataset, so I can score up the data from a particular model.
I need to do these tasks as quickly as possible and, hence, I am trying to find a solution that avoids using lapply/ loops. Here is (a slightly convoluted version of) my existing solution for setting the factor levels, which is currently too slow for my requirements:
lapply(1:ncol(Data[,DD$Class=="Factor"]), function(i) {
factor( as.character( unlist( Data[,DD$Class=="Factor"][i])) ,
levels = unlist(strsplit(as.character(DD$Levels[DD$Class=="Factor"][i]), ", ")) )
})
Any suggestions for avoiding use of a loop here, if it is even possible, or any alternative solutions would be much appreciated!
Thanks!
Sorry that I don't have enough reputationto add this as a comment.
Can I ask:
1. What's the dimension of your dataset?
2. What's the running time you may satisfy?
You can consider to use Microsoft Open R (Previsouly Revolution R),which optimises basic data manipulation.

R - show only levels used in a subset of data frame

I have a rather large data frame with a factor that has a lot of levels (more than 4,000). I have another column in the same data frame that I'm using as a reference, and what I'd like to find is a subset of the levels whenever this reference column is NA.
The first step I'm using is subsetrows <- which(is.na(mydata$reference)) but after that I'm stuck. I want something like levels(mydata[subsetrows,mydata$factor]) but unfortunately, this command shows me all the levels and not just the ones existing in subsetrows. I suppose I could create a new vector outside of my data frame of only my subset rows and then drop any unused levels, but is there any easier/cleaner way to do this, possibly without copying my data outside the data frame?
As an example of what I want returned, if my data frame has factor levels from A to Z, but in my subset only P, R and Y appear, I want something that returns the levels P, R and Y.
You can certainly accomplish this with base functions. But my personal preference is to use dplyr with chained operations such as this:
library(dplyr)
d %>%
filter(is.na(ref)) %>%
select(field) %>%
distinct()
data
d <- data.frame(
field = c("A", "B", "C", "A", "B", "C"),
ref = c(NA, "a", "b", NA, "c", NA)
)
I modified a suggestion in the comments by Marat to use the function unique that seems to return the correct levels.
Solution:
subsetrows <- which(is.na(mydata$reference))
unique(as.character(mydata$factor[subsetrows]))
While I like learning new packages and functions, this solution seems better at this point since it's more compact and easier for me to understand if I need to revisit this code at some distant point in the future.

How do I generate a boxplot using the original data order (not alphabetical)?

I am new to R. I've made a boxplot of my data but currently R is sorting the factors alphabetically. How do I maintain the original order of my data? This is my code:
boxplot(MS~Code,data=Input)
I have 40 variables that I wish to boxplot in the same order as the original data frame lists them. I've read that I may be able to set sort.names=FALSE to maintain the original order by I don't understand where that piece of code would go.
Is there a way to redefine my Input before it goes into boxplot?
Thank you.
factor the variable again as you wish in line 3
data(InsectSprays)
data <- InsectSprays
data$spray <- factor(data$spray, c("B", "C", "D", "E", "F", "G", "A"))
boxplot(count ~ spray, data = data, col = "lightgray")
The answer above is 98% of the way there.
set.seed(1)
# original order is E - A
Input <- data.frame(Code=rep(rev(LETTERS[1:5]),each=5),
MS=rnorm(25,sample(1:5,5)))
boxplot(MS~Code,data=Input) # plots alphabetically
Input$Code <- with(Input,factor(Code,levels=unique(Code)))
boxplot(MS~Code,data=Input) # plots in original order

Resources