I have a column in my dataframe that is made up of strings of numbers, separated by commas. I would like to convert the string to a list of numbers, and then get the mean. My dataframe, df:
a3
1,5,2
103.1
34,6
First, I converted the string to a list:
> df$a3_list <- strsplit(as.character(df$a3), split = ',')
New df:
a3 a3_list
1,5,2 c("1", "5", "2")
103.1 103.1
34,6 c("34", "6")
At this point, however, I'm not sure how to get a new column containing the mean of each cell in df$a3_list
You can use stringi, it's fast
library(stringi)
mat <- stri_split_fixed(df$a3, ',', simplify=T)
mat <- `dim<-`(as.numeric(mat), dim(mat)) # convert to numeric and save dims
rowMeans(mat, na.rm=T)
# [1] 2.666667 103.100000 20.000000
or with Base R
sapply(strsplit(as.character(df$a3), ",", fixed=T), function(x) mean(as.numeric(x)))
Another base R option
rowMeans(read.table(text=df$a3, sep=",", fill=TRUE), na.rm=TRUE)
#[1] 2.666667 103.100000 20.000000
NOTE: Assuming that the 'a3' is character class. Otherwise, wrap with as.character(df$a3)
data
df <- structure(list(a3 = c("1,5,2", "103.1", "34,6")), .Names = "a3",
class = "data.frame", row.names = c(NA, -3L))
Related
I state that I am a neophyte.
I have a single column (character) dataframe on which I would like to find the minimum, maximum
and average price. The min () and max () functions also work with a character vector, but the mean
() or median () functions need a numeric vector. I have tried to change the comma with the period
but the problem becomes more complex when I have the prices in the thousands. How can I do?
>price
Price
1 1.651
2 2.229,00
3 1.899,00
4 2.160,50
5 1.709,00
6 1.723,86
7 1.770,99
8 1.774,90
9 1.949,00
10 1.764,12
This is the dataframe. I thank anyone who wants to help me in advance
Replace , with ., . with empty string and turn the values to numeric.
In base R using gsub -
df <- transform(df, Price = as.numeric(gsub(',', '.',
gsub('.', '', Price, fixed = TRUE), fixed = TRUE)))
# Price
#1 1651.00
#2 2229.00
#3 1899.00
#4 2160.50
#5 1709.00
#6 1723.86
#7 1770.99
#8 1774.90
#9 1949.00
#10 1764.12
You can also use parse_number number function from readr.
library(readr)
df$Price <- parse_number(df$Price,
locale = locale(grouping_mark = ".", decimal_mark = ','))
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))
url <- "https://www.shoppydoo.it/prezzi-notebook-mwp72t$2fa.html?src=user_search"
page <- read_html(url)
price <- page %>% html_nodes(".price") %>% html_text() %>% data.frame()
colnames(price) <- "Price"
price$Price <- gsub("da ", "", price$Price)
price$Price <-gsub("€", "", price$Price)
price$Price <-gsub(".", "", price$Price
)
We could use chartr in base R
df$Price <- with(df, as.numeric(sub(",", "", chartr('[.,]', '[,.]', df$Price))))
data
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))
I have a dataset with multiple character columns with numbers and >,< signs.
I want to change them all to numeric.
The values with "<x" are supposed to be halfed and the values with ">x" are supposed to equal to x.
Sample dataframe and my approach (data=labor_df):
data a b c
1 "1" "9" "20"
2 "<10" "14" "1.99"
3 "12" ">5" "14.5"
half.value.a <- (as.numeric(str_extract(labor_df$"a"[which((grepl(labor_df$"a",
pattern = c("<"),fixed = T)))],
"\\d+\\.*\\d*")))/2
min.value.a <- as.numeric(str_extract(labor_df$"a"[which((grepl(labor_df$"a",
pattern = c(">"),fixed = T)))], "\\d+\\.*\\d*"))
labor_df$"a"[which((grepl(labor_df$"a",
pattern = c("<"),fixed = T)))
] <- half.value.a
labor_df$"a"[which((grepl(labor_df$"a",
pattern = c(">"),fixed = T)))
] <- min.value.a
labor_df$"a" <- as.numeric(labor_df$"a")
I would like to apply this to multiple columns in my df or use a different approach entirely to convert multiple columns in my df to numeric.
You can apply this approach to whichever columns you want. In this case, if you want to apply to columns 1 through 3, you can specify as labor_df[1:3]. If you want to apply to specific columns based on the column name, then create a cols vector containing the names of columns to apply this to and use labor_df[cols] instead.
The first gsub will remove the greater than sign, and keep the value unchanged. The ifelse is vectorized and will apply to all values in the column. It will first check with grepl if less than sign is present; if it is, remove it, convert to a numeric value, and then divide by 2. Otherwise leave as is.
labor_df[1:3] <- lapply(labor_df[1:3], function(x) {
x <- gsub(">", "", x)
x <- ifelse(grepl("<", x), as.numeric(gsub("<", "", x)) / 2, x)
as.numeric(x)
})
labor_df
Output
a b c
1 1 9 20.00
2 5 14 1.99
3 12 5 14.50
Data
labor_df <- structure(list(a = c("1", "<10", "12"), b = c("9", "14", ">5"
), c = c("20", "1.99", "14.5")), class = "data.frame", row.names = c(NA,
-3L))
I have a parameter as AD in columns. But it is different sequence in per row. How can i pick 'AD' from X2.
X1 X2
GT:GQ:GQX:DPI:AD:DP 0/1:909:12:125:93,26:119
GT:GQ:GQX:DPI:AD 0/1:909:12:125:35,24
GT:GQ:GQX:DP:DPF:AD 0/1:57:3:11:130:8,3
GT:AD:DP:GQ:PL 0/1:211,31:242:99:138,0,7251
Output
AD
93,26
35,24
8,3
211,31
Split columns at ":" using strsplit and select "AD" position identified using grep with an mapply.
mapply(`[`, strsplit(d$X2, ":"), sapply(strsplit(d$X1,":"), grep, pattern="AD"))
# [1] "93,26" "35,24" "8,3" "211,31"
Data:
d <- structure(list(X1 = c("GT:GQ:GQX:DPI:AD:DP", "GT:GQ:GQX:DPI:AD",
"GT:GQ:GQX:DP:DPF:AD", "GT:AD:DP:GQ:PL"), X2 = c("0/1:909:12:125:93,26:119",
"0/1:909:12:125:35,24", "0/1:57:3:11:130:8,3", "0/1:211,31:242:99:138,0,7251"
)), class = "data.frame", row.names = c(NA, -4L))
Maybe you can try regmatches + regexpr when with base R
> unlist(regmatches(df$X2,regexpr("\\d+,\\d+",df$X2)))
[1] "93,26" "35,24" "8,3" "211,31"
Using base R and split to extract the "AD" element.
mapply(
function(x, i) x[i],
strsplit(df$X2, ":"),
lapply(strsplit(df$X1, ":"), function(x) which(x == "AD"))
)
[1] "93,26" "35,24" "8,3" "211,31"
Reproducible data
df <- data.frame(
X1 = c("GT:GQ:GQX:DPI:AD:DP", "GT:GQ:GQX:DPI:AD", "GT:GQ:GQX:DP:DPF:AD", "GT:AD:DP:GQ:PL"),
X2 = c("0/1:909:12:125:93,26:119", "0/1:909:12:125:35,24", "0/1:57:3:11:130:8,3", "0/1:211,31:242:99:138,0,7251")
)
I have used cut() on a column to bin the data. I have about 130 rows that are of the structure (93.7,94.1] all in one column. I would like to split the the values into their own columns but am struggling to do this. Here is my code so far:
binned=cut(df$value, 136)
binned_df = data.frame(levels(binned))
Here are the first 10 rows of binned_df:
(42.9,43.4]
(43.4,43.8]
(43.8,44.2]
(44.2,44.6]
(44.6,44.9]
(44.9,45.3]
(45.3,45.7]
(45.7,46.1]
(46.1,46.5]
(46.5,46.9]
Are there any functions to do this? Any help would be much appreciated as I am quite new to R.
We can use read.csv to split the column into two, after removing the ( and ] with gsub. It would use the sep as ,
df1 <- read.csv(text = gsub("\\(|\\]", "", binned_df[[1]]), header = FALSE)
data
binned_df <- structure(list(col = c("(42.9,43.4]", "(43.4,43.8]", "(43.8,44.2]",
"(44.2,44.6]", "(44.6,44.9]", "(44.9,45.3]", "(45.3,45.7]", "(45.7,46.1]",
"(46.1,46.5]", "(46.5,46.9]")), class = "data.frame", row.names = c(NA,
-10L))
x <- c("(42.9,43.4]", "(43.4,43.8]", "(43.8,44.2]")
limits <- do.call(rbind, #combine result in matrix
strsplit( #split by ,
substring(x, 2, nchar(x) - 1), #remove first and last char
",", fixed = TRUE))
mode(limits) <- "numeric" #change to numeric
limits
# [,1] [,2]
#[1,] 42.9 43.4
#[2,] 43.4 43.8
#[3,] 43.8 44.2
I have two data frame each with a column Name
df1:
name
#one2
!iftwo
there_2_go
come&go
df1 = structure(list(name = c("#one2", "!iftwo", "there_2_go", "come&go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
df2:
name
One2
IfTwo#
there-2-go
come.go
df2 = structure(list(name = c("One2", "IfTwo#", "there-2-go", "come.go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
Now to compare the two data frames for inequality is cumbersome because of special symbols using %in%. To remove the special symbols using stringR can be useful. But how exactly we can use stringR functions with %in% and display the mismatch between them
have already done the mutate() to convert all in lowercasestoLower()as follows
df1<-mutate(df1,name=tolower(df1$name))
df2<-mutate(df2,name=tolower(df2$name))
Current output of comparison:
df2[!(df2 %in% df1),]
[1] "one2" "iftwo#" "there-2-go" "come.go"
Expected output as essentially the contents are same but with special symbols:
df2[!(df2 %in% df1),]
character(0)
Question : How do we ignore the symbols in the contents of the Frame
Here it is in a function,
f1 <- function(df1, df2){
i1 <- tolower(gsub('[[:punct:]]', '', df1$name))
i2 <- tolower(gsub('[[:punct:]]', '', df2$name))
d1 <- sapply(i1, function(i) grepl(paste(i2, collapse = '|'), i))
return(!d1)
}
f1(df, df2)
# one2 iftwo there2go comego
# FALSE FALSE FALSE FALSE
#or use it for indexing,
df2[f1(df, df2),]
#character(0)