I am quite new to R so eggscuse my lack of ability. I have tried and failed a fair bit, and would appreciate any input.
I am asked to get rid of inconsistent use of "." and "," to indicate decimals by multiplying every number in certain columns by some multiple of 10. I have tried to simply multiply using the binary operator * but it obviously doesnt work as some columns are factors, which is required in this case.
I have tried using this code aswell but get erros :subscript "Var" cant be "NA"
data %>% mutate_if(is.numeric, ~ . * 1000)
Below is the code I have for my dataset
datat <- c("Starting_year" , "Rank" , "Team" , "Home_total_Games", "Home_Total_Attendance" , "Home_Avg_Attendance" , "Home_capacity" , "Away_Total_Attendance" , "Away_Avg_Attendance" , "Away_Capacity")
names(data) <- datat
Factors assigned
data$Rank <- as.factor(data$Rank)
data$Starting_year <- as.factor(data$Starting_year)
Thanks in advance
Cant embed but there is a picture below of the data. I am asked to use a function in dplyr to multiply the columns by 1000 to remove all the . and ,
dataset
What is the format of numbers?
If the format is: 1.000.000,5, where . is a thousand separator, while , is a decimal separator, just use gsub:
foo = "1.000.000,5"
bar = gsub("\\.", "", foo) # "1000000,5"
baz = gsub(",", "\\.", bar) # "1000000.5"
as.numeric(baz)
In this case, factor is not a problem because gsub will de-factor the vector.
If you need to multiply the numbers after that, it is not a problem. Transform this into a function (such as convert_decimal) and apply it to columns you want:
data$column = convert_decimal(data$column)
For multiple selected columns (let's call the vector of names selection):
data[selection] = lapply(data[selection], convert_decimal)
Related
I'm new to R and want to replace all the special characters in my dataframe.
i've looked it up via stack and it partially works. All the special characters are replaced with there normal counterparts example ä --> a . The problem i'm encountering is that the dataframe doesn't exist anymore.
funtion to replace
plain_text = function(x) {
old1 <- "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüý"
new1 <- "szyaaaaaaceeeeiiiidnooooouuuuy"
x = apply(x,2,function(x) gsub(old1,new1,x))
}
R script
df1 = plain_text(df)
dataframe
V1
1 c("Foo","Bar","Foo","Bar")
2 c("Foo","Bar","Foo","Bar")
3 c("fixed","fixed","not fixed","fixed")
Don't use apply, it can be (and is here) destructive to a data.frame. In this case, you can use chartr. I caution that your function should take into considering if a column is character or not (since fixing letters on a numeric column breaks it).
plain_text = function(x) {
old1 <- "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüý"
new1 <- "szyaaaaaaceeeeiiiidnooooouuuuy"
ischr <- sapply(x, is.character)
x[ischr] <- lapply(x[ischr], chartr, old = old1, new = new1)
x
}
The chartr function translates from one set of characters to another (so that old= and new= must be strings with the same number of characters). It is analogous to the shell command tr.
The reason apply is bad is that it converts its arguments to a matrix before doing anything. If the frame is all character, then this does not destroy any data, but it does lose the data.frame structure (perhaps easily re-applied with as.data.frame). A more idiomatic way for a frame is to lapply over its columns (analogous to MARGIN=2 for apply), and it returns a list. (A data.frame is effectively just a special-case list.) If we just ran lapply(x, ...) and reassigned it to x, then x would now be a list; however, by reassigning to specific columns with x[ischr]<- (or all columns using x[]<-, not shown here), then x is still a frame albeit with those columns changed.
Lastly, gsub is not used well there because it is looking for the entire string "šžþàáâãäåçèéêëìíîïðñòóôõöùúûüý", not just one of its characters. What this job needs (I believe) is a one-by-one look at the characters, and replace them in-kind.
I have a column of a data frame that has thousands complicate sample names like this
sample- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
I am trying with no success to change the sample names to achieve the following sample names
16.3R1, 16.3R2, 2.3R1,2.3R2
I am thinking of solving the problem with qsub or stringsR.
Any suggestion? I have tried qsub but not retrieving the desirable name
You can use sub to extract the parts :
sample <- c("16_3_S16_R1_001","16_3_S16_R2_001","2_3_S2_R1_001","2_3_S2_R2_001")
sub('(\\d+)_(\\d+)_.*(R\\d+).*', '\\1.\\2\\3', sample)
#[1] "16.3R1" "16.3R2" "2.3R1" "2.3R2"
\\d+ refers to one or more digits. The values captured between () are called as capture groups. So here we are capturing one or more digits(1), followed by underscore and by another digit (2) and finally "R" with a digit (3). The values which are captured are referred using back reference so \\1 is the first value, \\2 as second value and so on.
If you split the string sample into substrings according to the pattern "_", you need only the 1st, 2n and 4th parts:
sample <- c("16_3_S16_R1_001",
"16_3_S16_R2_001",
"2_3_S2_R1_001",
"2_3_S2_R2_001")
x <- strsplit(sample, "_")
sapply(x, function(y) paste0(y[1], ".", y[2], y[4]))
Here is one way you could do it.
It helps to create a data frame with a header column, so it's what I did below, and I called the column "cats"
trial <- data.frame( "cats" = character(0))
x <- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
df <- data.frame("cats" = x)
The data needs to be in the right structure, in our case, as.factor()
df$cats <- as.factor(df$cats)
levels(df$cats)[levels(df$cats)=="16_3_S16_R1_001"] <- "16.3R1"
levels(df$cats)[levels(df$cats)=="16_3_S16_R2_001"] <- "16.3R2"
levels(df$cats)[levels(df$cats)=="2_3_S2_R1_001"] <- "2.3R1"
levels(df$cats)[levels(df$cats)=="2_3_S2_R2_001"] <- "2.3R2"
And voilà
I have a dataframe df and the first column looks like this:
[1] "760–563" "01455–1" "4672–04" "11–31234" "22–12" "11111–53" "111–21" "17–356239" "14–22352" "531–353"
I want to split that column on -.
What I'm doing is
strsplit(df[,1], "-")
The problem is that it's not working. It returns me a list without splitting the elements. I already tried adding the parameter fixed = TRUE and putting a regular expressing on the split parameter but nothing worked.
What is weird is that if I replicate the column on my own, for example:
myVector <- c("760–563" "01455–1" "4672–04" "11–31234" "22–12" "11111–53" "111–21" "17–356239" "14–22352" "531–353")
and then apply the strsplit, it works.
I already checked my column type and class with
class(df[,1]) and typeof(df[,1]) and both returns me character, so it's good.
I was also using the dataframe with dplyr so it was of the type tbl_df. I converted it back to dataframe but didn't work too.
Also tried apply(df, 2, function(x) strsplit(x, "-", fixed = T)) but didn't work too.
Any clues?
I don't know how you did it, but you have two different types of dashes:
charToRaw(substr("760–563", 4, 4))
#[1] 96
charToRaw("-")
#[1] 2d
So the strsplit() is working just fine, it's just that the dash isn't there in your original data. Adjust this, and away you go:
strsplit("760–563", "–")
#[[1]]
#[1] "760" "563"
You can just split on a non-numeric character
library(dplyr)
library(tidyr)
data %>%
separate(your_column,
c("first_number", "second_number"),
sep = "[^0-9]")
I have a data.frame (PC) that looks like this:
http://i.stack.imgur.com/NWJKe.png
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
http://i.stack.imgur.com/vQ48u.png
I want to sort the columns (beginning with "GTEX.") in the data.frame such that they are ordered by the age indicated in the age matrix.
PC <- read.csv("protein_coding.csv")
age <- read.table("Annotations_SubjectPhenotypes_DS.txt")
I started by changing the names in the age matrix to replace the '-' by '.':
new_SUBJID <- gsub("-", ".", age$SUBJID, fixed = TRUE)
age[, "SUBJID"] <- new_SUBJID
Then, I ordered the row names (SUBJUD) of the age matrix by age:
sort.age <- with(age, age[order(AGE) , ])
sort.age <- na.omit(sort.age)
I then created a vector age.ID containing the SUBJIDs in the right order (= how I want to order the columns from the PC matrix).
age.id <- sort.age$SUBJID
But then I am blocked since the names on the PC matrix and the age matrix are not the same... Could someone please help me?
Thank you very much in advance!
Svalf
It would have been better to show the example without using an image. Suppose, if there are two strings,
str1 <- c('GTEX.N7MS.0007.SM.2D7W1', 'GTEX.PFPP.0007.SM.2D8W1', 'GTEX.N7MS.0008.SM.4E3J1')
str2 <- c('GTEX.N7MS', 'GTEX.PFPP')
representing the column names of 'PC' and the 'SUBJID' column of 'age' dataset (after replacing the - with . and sorted), we remove the suffix part by matching the . followed by 4 digits (\\d{4}) followed by one or more characters to the end of the string (.*$) and replace it by ''.
str1N <- sub('\\.\\d{4}.*$', '', str1)
str1[order(match(str1N, str2))]
#[1] "GTEX.N7MS.0007.SM.2D7W1" "GTEX.N7MS.0008.SM.4E3J1"
#[3] "GTEX.PFPP.0007.SM.2D8W1"
I know there are many similar questions around but I'm afraid couldn't get my head around this particular one, though obviously it is very simple!
I am trying to write a simple ifelse function to be applied over a series of columns in a data frame by using column names (rather than numbers). What I try to do is to create a single u_all variable as shown below without typing column names repeatedly.
dat <- data.frame(id=c(1:20),u1 = sample(c(0:1),20,replace=T) , u2 = sample(c(0:1),20,replace=T) , u3 = sample(c(0:1),20,replace=T))
dat<-within(dat,u_all<-ifelse (u1==1 | u2==1 |u3==1,1,0))
dat
I tried many variants of apply but clearly I'm not on the right track as those grouping functions replicate the ifelse function on each column separately.
dat2 <- data.frame(id=c(1:20),u1 = sample(c(0:1),20,replace=T) , u2 = sample(c(0:1),20,replace=T) , u3 = sample(c(0:1),20,replace=T))
dat2<-cbind(dat2,sapply(dat2[,grepl("^u\\d{1,}",colnames(dat2))],
function(x){ u_all<-ifelse(x==1 & !is.na(x),1,0)}))
dat2
This line from the OP
dat<-within(dat,u_all<-ifelse (u1==1 | u2==1 |u3==1,1,0))
can instead be written as
dat$u_all <- +Reduce("|", dat[, c("u1", "u2", "u3")])
How it works, in terms of intermediate objects:
D = dat[, c("u1", "u2", "u3")] uses the names of the columns to subset the data frame.
r = Reduce("|", D) collapses the data by putting | between each pair of columns. The result is a logical (TRUE/FALSE) vector.
To convert r to a 0/1 integer vector, you could use ifelse(r,1L,0L) or as.integer(r) (since TRUE/FALSE converts to 1/0 by default) or just the unary +, like +r.
If you want to avoid using column names (it's really not clear to me from the post), you can construct D = dat[-1] to exclude the first column instead.
You were almost there, here's a solution using apply over rows and using all to transform a vector of tests to a single digit.
dat2$u_all <- apply(dat2[,-1], MARGIN=1, FUN=function(x){
any(x==1)&all(!is.na(x))*1
}
)