Seperate columns with space - r

df<-separate(df$ALISVERIS_TARIHI, c("key","value")," ", extra=merge)
Error in UseMethod("separate_") :
no applicable method for 'separate_' applied to an object of class "character"
"20190901" how can I separate this into 3 columns like 2019 09 01?

If you want to separate the 1st column based on number of characters you can use extract as -
df <- tidyr::extract(df, ALISVERIS_TARIHI, c('year', 'month', 'day'), '(.{4})(..)(..)')
df
# year month day a
#1 2019 09 01 a
#2 2019 09 08 b
The same pattern can be used with strcapture in base R -
data <- strcapture('(.{4})(..)(..)', df$ALISVERIS_TARIHI,
proto = list(year = integer(), month = integer(), day = integer()))
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(ALISVERIS_TARIHI = c('20190901', '20190908'), a = c('a', 'b'))

We could use read.table from base R
cbind(read.table(text = as.character(as.Date(df$ALISVERIS_TARIHI,
format = "%Y%m%d")), sep="-", header = FALSE,
col.names = c("year", "month", "day")), df['a'])
year month day a
1 2019 9 1 a
2 2019 9 8 b
data
df <- structure(list(ALISVERIS_TARIHI = c("20190901", "20190908"),
a = c("a", "b")), class = "data.frame", row.names = c(NA,
-2L))

Related

R: adding matching vector values from two dataframes in one column

I have a data frame which is configured roughly like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(0,0,0))
words
frequency
count
hello
7
0
yes
8
0
example
5
0
What I'm trying to do is add values to the third column from a different data frame, which is similiar but looks like this:
df2 <- cbind(c('example','hello') ,c(5,6))
words
frequency
example
5
hello
6
My goal is to find matching values for the first column in both data frames (they have the same column name) and add matching values from the second data frame to the third column of the first data frame.
The result should look like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(6,0,5))
words
frequency
count
hello
7
6
yes
8
0
example
5
5
What I've tried so far is:
df <- merge(df,df2, by = "words", all.x=TRUE)
However, it doesn't work.
I could use some help understanding how could it be done. Any help will be welcome.
This is an "update join". My favorite way to do it is in dplyr:
library(dplyr)
df %>% rows_update(rename(df2, count = frequency), by = "words")
In base R you could do the same thing like this:
names(df2)[2] = "count2"
df = merge(df, df2, by = "words", all.x=TRUE)
df$count = ifelse(is.na(df$coutn2), df$count, df$count2)
df$count2 = NULL
Here is an option with data.table:
library(data.table)
setDT(df)[setDT(df2), on = "words", count := i.frequency]
Output
words frequency count
<char> <num> <num>
1: hello 7 6
2: yes 8 0
3: example 5 5
Or using match in base R:
df$count[match(df2$words, df$words)] <- df2$frequency
Or another option with tidyverse using left_join and coalesce:
library(tidyverse)
left_join(df, df2 %>% rename(count.y = frequency), by = "words") %>%
mutate(count = pmax(count.y, count, na.rm = T)) %>%
select(-count.y)
Data
df <- structure(list(words = c("hello", "yes", "example"), frequency = c(7,
8, 5), count = c(0, 0, 0)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(words = c("example", "hello"), frequency = c(5, 6)), class = "data.frame", row.names = c(NA,
-2L))

how to convert a price character vector to a numeric vector

I state that I am a neophyte.
I have a single column (character) dataframe on which I would like to find the minimum, maximum
and average price. The min () and max () functions also work with a character vector, but the mean
() or median () functions need a numeric vector. I have tried to change the comma with the period
but the problem becomes more complex when I have the prices in the thousands. How can I do?
>price
Price
1 1.651
2 2.229,00
3 1.899,00
4 2.160,50
5 1.709,00
6 1.723,86
7 1.770,99
8 1.774,90
9 1.949,00
10 1.764,12
This is the dataframe. I thank anyone who wants to help me in advance
Replace , with ., . with empty string and turn the values to numeric.
In base R using gsub -
df <- transform(df, Price = as.numeric(gsub(',', '.',
gsub('.', '', Price, fixed = TRUE), fixed = TRUE)))
# Price
#1 1651.00
#2 2229.00
#3 1899.00
#4 2160.50
#5 1709.00
#6 1723.86
#7 1770.99
#8 1774.90
#9 1949.00
#10 1764.12
You can also use parse_number number function from readr.
library(readr)
df$Price <- parse_number(df$Price,
locale = locale(grouping_mark = ".", decimal_mark = ','))
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))
url <- "https://www.shoppydoo.it/prezzi-notebook-mwp72t$2fa.html?src=user_search"
page <- read_html(url)
price <- page %>% html_nodes(".price") %>% html_text() %>% data.frame()
colnames(price) <- "Price"
price$Price <- gsub("da ", "", price$Price)
price$Price <-gsub("€", "", price$Price)
price$Price <-gsub(".", "", price$Price
)
We could use chartr in base R
df$Price <- with(df, as.numeric(sub(",", "", chartr('[.,]', '[,.]', df$Price))))
data
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))

identify observations based on 2 elements in 2 dataframes that do not match [duplicate]

This question already has answers here:
Delete rows that exist in another data frame? [duplicate]
(3 answers)
Find complement of a data frame (anti - join)
(7 answers)
Closed 2 years ago.
I want to identify observations in 1 df that do not match that of another df using 2 indicators (id and date). Below is sample df1 and df2.
df1
id date n
12-40 12/22/2018 3
11-08 10/02/2016 11
df2
id date interval
12-40 12/22/2018 3
11-08 10/02/2016 32
22-22 11/10/2015 11
I want a df that outputs rows that are in df2, but not in df1, like so. Note that row 3 (based on id and date) of df2 is not in df1.
df3
id date interval
22-22 11/10/2015 11
I tried doing this in tidyverse and was not able to get the code to work. Does anyone have suggestions on how to do this?
We can use anti_join from tidyverse (as the OP mentioned about working with tidyverse). Here we use both 'id' and 'date' as mentioned in the OP's post. More complex joins can be done with tidyverse
library(dplyr)
anti_join(df2, df1, by = c('id', 'date'))
# id date interval
#1 22-22 11/10/2015 11
Or a similar option with data.table and it should be very efficient
library(data.table)
setDT(df2)[!df1, on = .(id, date)]
# id date interval
#1: 22-22 11/10/2015 11
data
df1 <- structure(list(id = c("12-40", "11-08"), date = c("12/22/2018",
"10/02/2016"), n = c(3L, 11L)), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(id = c("12-40", "11-08", "22-22"), date = c("12/22/2018",
"10/02/2016", "11/10/2015"), interval = c(3L, 32L, 11L)), class = "data.frame",
row.names = c(NA,
-3L))
Try this (Both options are base R, follow OP directions and do not require any package):
#Code1
df3 <- df2[!paste(df2$id,df1$date) %in% paste(df1$id,df2$date),]
Output:
id date interval
3 22-22 11/10/2015 11
It can also be considered:
#Code 2
df3 <- subset(df2,!paste(id,date) %in% paste(df1$id,df1$date))
Output:
id date interval
3 22-22 11/10/2015 11
Some data used:
#Data1
df1 <- structure(list(id = c("12-40", "11-08"), date = c("12/22/2018",
"10/02/2016"), n = c(3L, 11L)), class = "data.frame", row.names = c(NA,
-2L))
#Data2
df2 <- structure(list(id = c("12-40", "11-08", "22-22"), date = c("12/22/2018",
"10/02/2016", "11/10/2015"), interval = c(3L, 32L, 11L)), class = "data.frame", row.names = c(NA,
-3L))
Another base R option using merge + subset + complete.cases
df3 <- subset(
u <- merge(df1, df2, by = c("id", "date"), all.y = TRUE),
!complete.cases(u)
)[names(df2)]
which gives
> df3
id date interval
3 22-22 11/10/2015 11

How can I filter based on 2 conditions

I am not able to filter based on 2 condition. as1 is a dataframe
as1
da cat
1 2016-06-04 04:05:45 A
2 2016-06-04 04:05:46 B
3 2016-06-04 04:05:45 C
4 2016-06-04 04:05:46 D
as2 <- as1 %>% filter(as.POSIXct("2016-06-04 04:05:45") && cat == "A")
I need below dataframe
as2
da cat
1 2016-06-04 04:05:45 A
Let's make some reproducible data as your question is missing it:
as1 <- read.csv(header = T, text = "
da, cat
2016-06-04 04:05:45,A
2016-06-04 04:05:46,B
2016-06-04 04:05:45,C
2016-06-04 04:05:46,D", stringsAsFactors = FALSE)
Now first thing you want to check is if the column "da" is, in fact, POSIXct.
class(as1$da)
#> [1] "character"
In my sample it is not, so I add an extra line to the dplyr pipe.
library(dplyr)
as2 <- as1 %>%
mutate(da = as.POSIXct(da)) %>% # add only if column isn't POSIXct
filter(da == as.POSIXct("2016-06-04 04:05:45") & cat == "A")
Basically what you did wrong was leaving as.POSIXct("2016-06-04 04:05:45") as the expression. filter evaluates a condition, meaning it only keeps the rows where something is TRUE. Hence to "2016-06-04 04:05:45" you need a test---da == as.POSIXct("2016-06-04 04:05:45").
For why you need & here and not &&, see this answer.
You were almost there This is a possible solution for you. You needed to format the data using lubridate before filtering the data.
# load library
library(dplyr)
# create data
x = data.frame(da = c("2019-10-04 07:05:02","2019-10-04 07:05:03","2019-10-04 07:05:02","2019-10-04 07:05:03","2019-10-04 07:05:04"),
db = c("a","a","c","a","a"), stringsAsFactors = F)
# convert to date time format
x$da = lubridate::ymd_hms(x$da)
# see the structure of data
str(x)
# filter the data
x %>% filter(da <= lubridate::ymd_hms('2019-10-04 07:05:02') & db == 'a' )
# da db
#1 2019-10-04 07:05:02 a
Your data
# Data
x = structure(list(da = structure(c(1464993345, 1464993346, 1464993345, 1464993346), class = c("POSIXct", "POSIXt"), tzone = ""), cat = structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor")), class = "data.frame", row.names = c(NA, -4L))
# convert to date time format
x$da = lubridate::ymd_hms(x$da)
# see the structure of data
str(x)
# filter the data
x %>% filter(da <= lubridate::ymd_hms('2016-06-03 15:35:45') & cat == 'A' )
# da cat
#1 2016-06-03 15:35:45 A

How to find the max value in a list and its row and column index in R?

I have a list like this:
> A
[[1]]
Jan Feb Mar
Product_A 14 7 90
Product_B 1 2 3
[[2]]
Jan Feb Mar
Product_C 15 7 9
I want to have the max value in this list and its row and column names. I would like to see something like this: 90, Product_A, Mar
I really appreciate it if somebody could help me with this.
Thanks
To me it is unclear whether you want to names of the max value of the whole list or of every dataframe inside the list. There is already an answer from #RonakShah for the latter interpretation, so I post one if you look for names of one max value of the whole list. Using list.which and the idea of a similiar question you can do
library(rlist)
library(reshape2)
max_val <- max(unlist(list_df))
which_list <- list.which(list_df, max(Jan, Feb, Mar) == max_val)
df <- list_df[[which_list]]
subset(melt(as.matrix(df)), value == max_val)
Var1 Var2 value
Product_A Mar 90
As #r2evans mentioned, we can try to find out how to solve it for one matrix/dataframe and then use lapply to apply it to list of dataframes. Assuming your list is called list_df one way would be
get_max_row_col <- function(x) {
val <- max(x, na.rm = TRUE)
arr <- which(x == val, arr.ind = TRUE)
data.frame(value = val, rowname = rownames(x)[arr[, 1]],
column_name = colnames(x)[arr[, 2]])
}
do.call(rbind, lapply(list_df, get_max_row_col))
# value rowname column_name
#1 90 Product_A Mar
#2 15 Product_C Jan
We can also use purrr's map_df or map_dfr with the same function
purrr::map_df(list_df, get_max_row_col)
As #schwantke points out that if you want to find one maximum value in the entire dataframe you can skip the lapply part bind all the dataframe together in one list and perform the calculation.
df1 <- do.call(rbind, list_df)
val <- max(df1)
arr <- which(df1 == val, arr.ind = TRUE)
data.frame(value = val, rowname = rownames(df1)[arr[, 1]],
column_name = colnames(df1)[arr[, 2]])
# value rowname column_name
#1 90 Product_A Mar
data
list_df <- list(structure(list(Jan = c(14L, 1L), Feb = c(7L, 2L), Mar = c(90L,
3L)), class = "data.frame", row.names = c("Product_A", "Product_B"
)), structure(list(Jan = 15L, Feb = 7L, Mar = 9L),
class = "data.frame", row.names = "Product_C"))

Resources