I have used cut() on a column to bin the data. I have about 130 rows that are of the structure (93.7,94.1] all in one column. I would like to split the the values into their own columns but am struggling to do this. Here is my code so far:
binned=cut(df$value, 136)
binned_df = data.frame(levels(binned))
Here are the first 10 rows of binned_df:
(42.9,43.4]
(43.4,43.8]
(43.8,44.2]
(44.2,44.6]
(44.6,44.9]
(44.9,45.3]
(45.3,45.7]
(45.7,46.1]
(46.1,46.5]
(46.5,46.9]
Are there any functions to do this? Any help would be much appreciated as I am quite new to R.
We can use read.csv to split the column into two, after removing the ( and ] with gsub. It would use the sep as ,
df1 <- read.csv(text = gsub("\\(|\\]", "", binned_df[[1]]), header = FALSE)
data
binned_df <- structure(list(col = c("(42.9,43.4]", "(43.4,43.8]", "(43.8,44.2]",
"(44.2,44.6]", "(44.6,44.9]", "(44.9,45.3]", "(45.3,45.7]", "(45.7,46.1]",
"(46.1,46.5]", "(46.5,46.9]")), class = "data.frame", row.names = c(NA,
-10L))
x <- c("(42.9,43.4]", "(43.4,43.8]", "(43.8,44.2]")
limits <- do.call(rbind, #combine result in matrix
strsplit( #split by ,
substring(x, 2, nchar(x) - 1), #remove first and last char
",", fixed = TRUE))
mode(limits) <- "numeric" #change to numeric
limits
# [,1] [,2]
#[1,] 42.9 43.4
#[2,] 43.4 43.8
#[3,] 43.8 44.2
Related
I state that I am a neophyte.
I have a single column (character) dataframe on which I would like to find the minimum, maximum
and average price. The min () and max () functions also work with a character vector, but the mean
() or median () functions need a numeric vector. I have tried to change the comma with the period
but the problem becomes more complex when I have the prices in the thousands. How can I do?
>price
Price
1 1.651
2 2.229,00
3 1.899,00
4 2.160,50
5 1.709,00
6 1.723,86
7 1.770,99
8 1.774,90
9 1.949,00
10 1.764,12
This is the dataframe. I thank anyone who wants to help me in advance
Replace , with ., . with empty string and turn the values to numeric.
In base R using gsub -
df <- transform(df, Price = as.numeric(gsub(',', '.',
gsub('.', '', Price, fixed = TRUE), fixed = TRUE)))
# Price
#1 1651.00
#2 2229.00
#3 1899.00
#4 2160.50
#5 1709.00
#6 1723.86
#7 1770.99
#8 1774.90
#9 1949.00
#10 1764.12
You can also use parse_number number function from readr.
library(readr)
df$Price <- parse_number(df$Price,
locale = locale(grouping_mark = ".", decimal_mark = ','))
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))
url <- "https://www.shoppydoo.it/prezzi-notebook-mwp72t$2fa.html?src=user_search"
page <- read_html(url)
price <- page %>% html_nodes(".price") %>% html_text() %>% data.frame()
colnames(price) <- "Price"
price$Price <- gsub("da ", "", price$Price)
price$Price <-gsub("€", "", price$Price)
price$Price <-gsub(".", "", price$Price
)
We could use chartr in base R
df$Price <- with(df, as.numeric(sub(",", "", chartr('[.,]', '[,.]', df$Price))))
data
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))
I have a dataset with multiple character columns with numbers and >,< signs.
I want to change them all to numeric.
The values with "<x" are supposed to be halfed and the values with ">x" are supposed to equal to x.
Sample dataframe and my approach (data=labor_df):
data a b c
1 "1" "9" "20"
2 "<10" "14" "1.99"
3 "12" ">5" "14.5"
half.value.a <- (as.numeric(str_extract(labor_df$"a"[which((grepl(labor_df$"a",
pattern = c("<"),fixed = T)))],
"\\d+\\.*\\d*")))/2
min.value.a <- as.numeric(str_extract(labor_df$"a"[which((grepl(labor_df$"a",
pattern = c(">"),fixed = T)))], "\\d+\\.*\\d*"))
labor_df$"a"[which((grepl(labor_df$"a",
pattern = c("<"),fixed = T)))
] <- half.value.a
labor_df$"a"[which((grepl(labor_df$"a",
pattern = c(">"),fixed = T)))
] <- min.value.a
labor_df$"a" <- as.numeric(labor_df$"a")
I would like to apply this to multiple columns in my df or use a different approach entirely to convert multiple columns in my df to numeric.
You can apply this approach to whichever columns you want. In this case, if you want to apply to columns 1 through 3, you can specify as labor_df[1:3]. If you want to apply to specific columns based on the column name, then create a cols vector containing the names of columns to apply this to and use labor_df[cols] instead.
The first gsub will remove the greater than sign, and keep the value unchanged. The ifelse is vectorized and will apply to all values in the column. It will first check with grepl if less than sign is present; if it is, remove it, convert to a numeric value, and then divide by 2. Otherwise leave as is.
labor_df[1:3] <- lapply(labor_df[1:3], function(x) {
x <- gsub(">", "", x)
x <- ifelse(grepl("<", x), as.numeric(gsub("<", "", x)) / 2, x)
as.numeric(x)
})
labor_df
Output
a b c
1 1 9 20.00
2 5 14 1.99
3 12 5 14.50
Data
labor_df <- structure(list(a = c("1", "<10", "12"), b = c("9", "14", ">5"
), c = c("20", "1.99", "14.5")), class = "data.frame", row.names = c(NA,
-3L))
There is a similar question about combining vectors with different lengths here, but all answers (except #Ronak Shah`s answer) loose the names/colnames.
My problem is that I need to keep the column names, which seems to be possible using the rowr package and cbind.fills.
I would like to stay in base-R or use stringi and the output shoud remain a matrix.
Test data:
inp <- list(structure(c("1", "2"), .Dim = 2:1, .Dimnames = list(NULL,"D1")),
structure(c("3", "4", "5"), .Dim = c(3L, 1L), .Dimnames = list(NULL, "D2")))
I know that I could get the column names beforehand and then reassign them after creating the matrix, like:
## Using stringi
colnam <- unlist(lapply(inp, colnames))
out <- stri_list2matrix(inp)
colnames(out) <- colnam
out
## Using base-R
colnam <- unlist(lapply(inp, colnames))
max_length <- max(lengths(inp))
nm_filled <- lapply(inp, function(x) {
ans <- rep(NA, length = max_length)
ans[1:length(x)]<- x
ans
})
out <- do.call(cbind, nm_filled)
colnames(out) <- colnam
out
Are there other options that keep the column names?
Since stringi is ok for you to use, you can use the function stri_list2matrix(), i.e.
setNames(as.data.frame(stringi::stri_list2matrix(inp)), sapply(inp, colnames))
# D1 D2
#1 1 3
#2 2 4
#3 <NA> 5
Here is a slightly more concise base R variation
len <- max(lengths(inp))
nms <- sapply(inp, colnames)
do.call(cbind, setNames(lapply(inp, function(x)
replace(rep(NA, len), 1:length(x), x)), nms))
# D1 D2
#[1,] "1" "3"
#[2,] "2" "4"
#[3,] NA "5"
Not sure if this constitutes a sufficiently different solution from what you've already posted. Will remove if deemed too similar.
Update
Or how about a merge?
Reduce(
function(x, y) merge(x, y, all = T, by = 0),
lapply(inp, as.data.frame))[, -1]
# D1 D2
#1 1 3
#2 2 4
#3 <NA> 5
The idea here is to convert the list entries to data.frames, then add a row number and merge by row and merge by row by setting by = 0 (thanks #Henrik). Note that this will return a data.frame rather than a matrix.
Here is using base:
do.call(cbind,
lapply(inp, function(i){
x <- data.frame(i, stringsAsFactors = FALSE)
as.matrix( x[ seq(max(lengths(inp))), , drop = FALSE ] )
#if we matrices have more than 1 column use:
#as.matrix( x[ seq(max(sapply(inp, nrow))), , drop = FALSE ] )
}
))
# D1 D2
# 1 "1" "3"
# 2 "2" "4"
# NA NA "5"
The idea is to make all matrices to have the same number of rows. When we subset dataframe by index, rows that do not exist will be returned as NA, then we convert back to matrix and cbind.
My goal is to turn a matrix of coordinate pairs into a single character string with the coordinate pairs pasted together. For example, I have the coordinates of a lines string:
mat <- routes#lines[[9]]#Lines[[1]]#coords
mat
[,1] [,2]
[1,] -122.4491 37.7698
[2,] -122.4519 37.7694
[3,] -122.4491 37.7698
Which I would like to convert into this:
"-122.4491,37.7698;-122.4519,37.7694;-122.4491,37.7698"
Where lat and lon of a single pair are separated by a comma, and pairs are separated by a semicolon.
apply(format(mat), 1, paste, sep=";", collapse = "")
does not produce the desired output. How would one do this in R ?
Here is the sample data:
dput(mat)
structure(c(-122.4491, -122.4519, -122.4491, 37.7698, 37.7694,
37.7698), .Dim = c(3L, 2L))
mat <- structure(c(-122.4491, -122.4519, -122.4491, 37.7698, 37.7694, 37.7698), .Dim = c(3L, 2L))
This is a matrix, with coordinates in two columns. You can just use one call to paste.
paste(mat[,1], mat[,2], sep = ",", collapse = ";")
# [1] "-122.4491,37.7698;-122.4519,37.7694;-122.4491,37.7698"
Here sep sets the delimiter between lat and long coordinates (cells in the same row), and collapse sets the delimiter between coordinate pairs (the delimiter between different rows).
mat <- structure(c(-122.4491, -122.4519, -122.4491, 37.7698, 37.7694, 37.7698), .Dim = c(3L, 2L))
You were close. Your use of apply is appropriate, but because you are operating first row-wise, you need to worry about one delimiting first:
apply(mat, 1, paste, collapse=",")
# [1] "-122.4491,37.7698" "-122.4519,37.7694" "-122.4491,37.7698"
... and then combine all of those with a single external paste:
paste(apply(mat, 1, paste, collapse=","), collapse=";")
# [1] "-122.4491,37.7698;-122.4519,37.7694;-122.4491,37.7698"
Another option is to convert to data.frame and then use do.call
do.call(paste, c(as.data.frame(mat), collapse=";", sep=","))
#[1] "-122.4491,37.7698;-122.4519,37.7694;-122.4491,37.7698"
I have a column in my dataframe that is made up of strings of numbers, separated by commas. I would like to convert the string to a list of numbers, and then get the mean. My dataframe, df:
a3
1,5,2
103.1
34,6
First, I converted the string to a list:
> df$a3_list <- strsplit(as.character(df$a3), split = ',')
New df:
a3 a3_list
1,5,2 c("1", "5", "2")
103.1 103.1
34,6 c("34", "6")
At this point, however, I'm not sure how to get a new column containing the mean of each cell in df$a3_list
You can use stringi, it's fast
library(stringi)
mat <- stri_split_fixed(df$a3, ',', simplify=T)
mat <- `dim<-`(as.numeric(mat), dim(mat)) # convert to numeric and save dims
rowMeans(mat, na.rm=T)
# [1] 2.666667 103.100000 20.000000
or with Base R
sapply(strsplit(as.character(df$a3), ",", fixed=T), function(x) mean(as.numeric(x)))
Another base R option
rowMeans(read.table(text=df$a3, sep=",", fill=TRUE), na.rm=TRUE)
#[1] 2.666667 103.100000 20.000000
NOTE: Assuming that the 'a3' is character class. Otherwise, wrap with as.character(df$a3)
data
df <- structure(list(a3 = c("1,5,2", "103.1", "34,6")), .Names = "a3",
class = "data.frame", row.names = c(NA, -3L))