I am trying to add a vector which I generated in R to a sqlite table as a new column. For this I wanted to use dplyr (I installed the most recent dev. version along with the dbplyr package according to this post here). What I tried:
library(dplyr)
library(DBI)
#creating initial database and table
dbcon <- dbConnect(RSQLite::SQLite(), "cars.db")
dbWriteTable(dbcon, name = "cars", value = cars)
cars_tbl <- dplyr::tbl(dbcon, "cars")
#new values which I want to add as a new column
new_values <- sample(c("A","B","C"), nrow(cars), replace = TRUE)
#attempt to add new values as column to the table in the database
cars_tbl %>% mutate(new_col = new_values) #not working
What is an easy way to achieve this (not necessarily with dplyr)?
Not aware of a way of doing this with dyplr, but you can do it with RSQLite directly. The problem is not actually with RSQLite, but the fact that I don't know how to pass a list to mutate. Note that, in your code, something like this would work:
cars_tbl %>% mutate(new_col = another_column / 3.14)
Anyway, my alternative. I've created a toy cars dataframe.
cars <- data.frame(year=c(1999, 2007, 2009, 2017), model=c("Ford", "Toyota", "Toyota", "BMW"))
I open connection and actually create the table,
dbcon <- dbConnect(RSQLite::SQLite(), "cars.db")
dbWriteTable(dbcon, name = "cars", value = cars)
Add the new column and check,
dbGetQuery(dbcon, "ALTER TABLE cars ADD COLUMN new_col TEXT")
dbGetQuery(dbcon, "SELECT * FROM cars")
year model new_col
1 1999 Ford <NA>
2 2007 Toyota <NA>
3 2009 Toyota <NA>
4 2017 BMW <NA>
And then you can update the new column, but the only tricky thing is that you have to provide a where statement, in this case I use the year.
new_values <- sample(c("A","B","C"), nrow(cars), replace = TRUE)
new_values
[1] "C" "B" "B" "B"
dbGetPreparedQuery(dbcon, "UPDATE cars SET new_col = ? where year=?",
bind.data=data.frame(new_col=new_values,
year=cars$year))
dbGetQuery(dbcon, "SELECT * FROM cars")
year model new_col
1 1999 Ford C
2 2007 Toyota B
3 2009 Toyota B
4 2017 BMW B
As a unique index, you could always use rownames(cars), but you would have to add it as a column in your dataframe and then in your table.
EDIT after suggestion by #krlmlr: indeed much better using dbExecute instead of deprecated dbGetPreparedQuery,
dbExecute(dbcon, "UPDATE cars SET new_col = :new_col where year = :year",
params=data.frame(new_col=new_values,
year=cars$year))
EDIT after comments: I did not think about this a few days ago, but even if it is a SQLite you can use the rowid. I've tested this and it works.
dbExecute(dbcon, "UPDATE cars SET new_col = :new_col where rowid = :id",
params=data.frame(new_col=new_values,
id=rownames(cars)))
Although you have to make sure that the rowid's in the table are the same as your rownames. Anyway you can always get your rowid's like this:
dbGetQuery(dbcon, "SELECT rowid, * FROM cars")
rowid year model new_col
1 1 1999 Ford C
2 2 2007 Toyota B
3 3 2009 Toyota B
4 4 2017 BMW B
Related
I have a Data Frame made up of several columns, each corresponding to a different industry per country. I have 56 industries and 43 countries and I'd select only industries from 5 to 22 per country (18 industries). The big issue is that each industry per country is named as: AUS1, AUS2 ..., AUS56. What I shall select is AUS5 to AUS22, AUT5 to AUT22 ....
A viable solution could be to select columns according to the following algorithm: the first column of interest, i.e., AUS5 corresponds to column 10 and then I select up to AUS22 (corresponding to column 27). Then, I should skip all the remaining column for AUS (i.e. AUS23 to AUS56), and the first 4 columns for the next country (from AUT1 to AUT4). Then, I select, as before, industries from 5 to 22 for AUT. Basically, the algorithm, starting from column 10 should be able to select 18 columns(including column 10) and then skip the next 38 columns, and then select the next 18 columns. This process should be repeated for all the 43 countries.
How can I code that?
UPDATE, Example:
df=data.frame(industry = c("C10","C11","C12","C13"),
country = c("USA"),
AUS3 = runif(4),
AUS4 = runif(4),
AUS5 = runif(4),
AUS6 = runif(4),
DEU5 = runif(4),
DEU6 = runif(4),
DEU7 = runif(4),
DEU8 = runif(4))
#I'm interested only in C10-c11:
df_a=df %>% filter(grepl('C10|C11',industry))
df_a
#Thus, how can I select columns AUS10,AUS11, DEU10,DEU11 efficiently, considering that I have a huge dataset?
Demonstrating the paste0 approach.
ctr <- unique(gsub('\\d', '', names(df[-(1:2)])))
# ctr <- c("AUS", "DEU") ## alternatively hard-coded
ind <- c(10, 11)
subset(df, industry == paste0('C', 10:11),
select=c('industry', 'country', paste0(rep(ctr, each=length(ind)), ind)))
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
Or, since you appear to like grep you could do.
df[grep('10|11', df$industry), grep('industry|country|[A-Z]{3}1[01]', names(df))]
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
If you have a big data set in memory, data.table could be ideal and much faster than alternatives. Something like the following could work, though you will need to play with select_ind and select_ctr as desired on the real dataset.
It might be worth giving us a slightly larger toy example, if possible.
library(data.table)
setDT(df)
select_ind <- paste0(c("C"), c("11","10"))
select_ctr <- paste0(rep(c("AUS", "DEU"), each = 2), c("10","11"))
df[grepl(paste0(select_ind, collapse = "|"), industry), # select rows
..select_ctr] # select columns
AUS10 AUS11 DEU10 DEU11
1: 0.9040223 0.2638725 0.9779399 0.1672789
2: 0.6162678 0.3095942 0.1527307 0.6270880
For more information, see Introduction to data.table.
I'm using RSQlite to import Datasets from an SQlite-Database. There are multiple millions of observations within the Database. Therefor I'd like to do as much as possible of Data selection and aggregation within the Database.
At some point I need to aggregate a character variable. I want to get the value which occures the most ordered by a group. How can I edit the following dplyr-chain so it works also with RSQlite?
library(tidyverse)
library(RSQLite)
# Path to Database
DATABASE="./xxx.db"
# Connect Database
mydb <- dbConnect(RSQLite::SQLite(), DATABASE)
# Load Database
data = tbl(mydb, "BigData")
# Query Database
Summary <- data %>%
filter(year==2020) %>%
group_by(Grouping_variable) %>%
summarize(count=n(),
Item_variable=names(which.max(table(Item_variable))))
Within R that code would do it's job. Querying the database I get an error code Error: near "(": syntax error
Original pipe contains more filters and steps.
Example Database would basically look like:
data.frame(Grouping_variable=c("A","A","B","C","C","C","D","D","D","D"),
year=c(2019,2020,2019,2020,2020,2020,2020,2020,2020,2021),
Item_variable=c("X","Y","Y","X","X","Y","Y","Y","X","X"))
Grouping_variable year Item_Variable
1 A 2019 X
2 A 2020 Y
3 B 2019 Y
4 C 2020 X
5 C 2020 X
6 C 2020 Y
7 D 2020 Y
8 D 2020 Y
9 D 2020 X
10 D 2021 X
Result should look like:
Grouping_variable count Item_variable
<chr> <int> <chr>
1 A 1 Y
2 C 3 X
3 D 3 Y
Assuming that DF is the data frame defined in the question and using SQL we calculate the count of each item within group in the year 2020 giving tmp and then take the row whose count is maximum giving tmp2 - SQLite guarantees that when using group by and max that the other fields come from the row where the maximum was found. Also take the sum of the counts in tmp2 and finally select just the desired columns.
library(sqldf)
sql <- "with tmp as (
select Grouping_variable, count(*) count, Item_variable from DF
where year = 2020
group by Grouping_variable, Item_variable
),
tmp2 as (
select Grouping_variable, max(count), sum(count) count, Item_variable
from tmp
group by Grouping_variable
)
select Grouping_variable, count, Item_variable
from tmp2
"
sqldf(sql)
giving:
Grouping_variable count Item_variable
1 A 1 Y
2 C 3 X
3 D 3 Y
Added
Suppose that DF were a table in your database. This code creates such a database.
library(RSQLite)
m <- dbDriver("SQLite")
con <- dbConnect(m, dbname = "database.sqlite")
dbWriteTable(con, 'DF', DF, row.names = FALSE)
dbDisconnect(con)
then this would run the sql command in the sql string defined above on that database and return the result.
library(RSQLite)
m <- dbDriver("SQLite")
con <- dbConnect(m, dbname = "database.sqlite")
result <- dbGetQuery(con, sql)
dbDisconnect(con)
Background:
I'm working with a fairly large (>10,000 rows) dataset of individual cars, and I need to do some analysis on it. I need to keep this dataset d intact, but I'm only going to be analyzing cars made by Japanese companies (e.g. Nissan, Honda, etc.). d contains information like VIN_prefix (the first two letters of a VIN number that indicates the "World Manufacturer Number"), model year, and make, but no explicit indicator of whether the car is made by a Japanese firm. Here's d:
d <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
stringsAsFactors=FALSE)
Here, rows 3, 4, and 5 correspond to Japanese cars: the NA in row 3 is actually an Acura whose make is missing. See below when I get to the other dataset about why this is.
d also lacks some attributes (columns) about cars that I need for my analysis, e.g. the current CEO of Japanese car firms.
Enter another dataset, a, a dataset about Japanese car firms which contains those extra attributes as well as columns that could be used to identify whether a given car (row) in d is made by a Japanese firm. One of those is VIN_prefix; the other is jp_makes, a list of Japanese auto firms. Here's a:
a <- data.frame(
VIN_prefix = c("JH","JF","1N"),
jp_makes = c("Acura","Subaru","Nissan"),
current_ceo = c("Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida"),
stringsAsFactors=FALSE)
Here, we can see that the "Acura" make, missing in the car from row 3 in d, could be identified by its VIN_prefix "JH", which in row 3 of d is not NA.
Goal:
Left join a onto d so that each of the 3 Japanese cars in d gets the relevant corresponding attributes from a - mainly, current_ceo. (Non-Japanese cars in d would have NA for columns joined from a; this is fine.)
Problem:
As you can tell, the two relevant variables in d that could be used as keys in a join - make and VIN_prefix - have missing data in d. The "matching rules" we could use are imperfect: I could match on d$make == a$jp_makes or on d$VIN_prefix == a$VIN_prefix, but they'd each be wrong due to the missing data in d.
What to do?
What I've tried:
I can try left joining on either one of these potential keys, but not all 3 of the Japanese cars in d wind up with their correct information from a:
try1 <- left_join(d, a, by = c("make" = "jp_makes"))
try2 <- left_join(d, a, by = c("VIN_prefix" = "VIN_prefix"))
I can successfully generate an logical 'indicator' variable in d that tells me whether a car is Japanese or not:
entries_make <- a$jp_makes
entries_vin_prefix <- a$VIN_prefix
d<- d %>%
mutate(is_jp = ifelse(d$VIN_prefix %in% entries_vin_prefix | d$make %in% entries_make, 1, 0)
%>% as.logical())
But that only gets me halfway: I still need those other columns from a to sit next to those Japanese cars in d. It's unfeasible to manually fill all the missing data in some other way; the real datasets these toy examples correspond to are too big for that and I don't have the manpower or time.
Ideally, I'd like a dataset that looks something like this:
ideal <- data.frame(
make = c("GMC","Dodge","NA","Subaru","Nissan","Chrysler"),
model_yr = c("1999","2004","1989","1999","2006","2012"),
VIN_prefix = c("1G","1D","JH","JF","NA","2C"),
current_ceo = c("NA", "NA", "Toshihiro Mibe","Tomomi Nakamura","Makoto Ushida", "NA"),
stringsAsFactors=FALSE)
What do you all think? I've looked at other posts (e.g. here) but their solutions don't really apply. Any help is much appreciated!
Left join on an OR of the two conditions.
library(sqldf)
sqldf("select d.*, a.current_ceo
from d
left join a on d.VIN_prefix = a.VIN_prefix or d.make = a.jp_makes")
giving:
make model_yr VIN_prefix current_ceo
1 GMC 1999 1G <NA>
2 Dodge 2004 1D <NA>
3 NA 1989 JH Toshihiro Mibe
4 Subaru 1999 JF Tomomi Nakamura
5 Nissan 2006 NA Makoto Ushida
6 Chrysler 2012 2C <NA>
Use a two pass method. First fill in the missing make (or VIN values). I'll illustrate by filling in make valuesDo notice taht "NA" is not the same as NA. The first is a character value while the latter is a true R missing value, so I'd first convert those to a true missing value. In natural language I am replacing the missing values in d (note correction of df) with values of 'jp_makes' that are taken from a on the basis of matching VIN_prefix values:
is.na( d$make) <- df$make=="NA"
d$make[is.na(df$make)] <- a$jp_makes[
match( d$VIN_prefix[is.na(d$make)], a$VIN_prefix) ]
Now you have the make values filled in on the basis of the table look up in a. It should be trivial to do the match you wanted by using by.x='make', by.y='jp_make'
merge(d, a, by.x='make', by.y='jp_makes', all.x=TRUE)
make model_yr VIN_prefix.x VIN_prefix.y current_ceo
1 Acura 1989 JH JH Toshihiro Mibe
2 Chrysler 2012 2C <NA> <NA>
3 Dodge 2004 1D <NA> <NA>
4 GMC 1999 1G <NA> <NA>
5 Nissan 2006 NA 1N Makoto Ushida
6 Subaru 1999 JF JF Tomomi Nakamura
You can then use the values in VIN_prefix.y to replace the values the =="NA" in VIN_prefix.x.
Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)
I am reading in data from a .txt file that contains over thousands of records
table1 <- read.table("teamwork.txt", sep ="|", fill = TRUE)
Looks like:
f_name l_name hours_worked code
Jim Baker 8.5 T
Richard Copton 4.5 M
Tina Bar 10 S
However I only want to read in data that has a 'S' or 'M' code:
I tried to concat the columns:
newdata <- subset(table1, code = 'S' |'M')
However I get this issue:
operations are possible only for numeric, logical or complex types
If there are thousands or tens of thousands of records (maybe not for millions), you should just be able to filter after you read in all the data:
> library(tidyverse)
> df %>% filter(code=="S"|code=="M")
# A tibble: 2 x 4
f_name l_name hours_worked code
<fct> <fct> <dbl> <fct>
1 Richard Copton 4.50 M
2 Tina Bar 10.0 S
If you really want to just pull in the rows that meet your condition, try sqldf package as in example here: How do i read only lines that fulfil a condition from a csv into R?
You can try
cols_g <- table1[which(table1$code == "S" | table1$code == "M",]
OR
cols_g <- subset(table1, code=="S" | code=="M")
OR
library(dplyr)
cols_g <- table1 %>% filter(code=="S" | code=="M")
If you want to add column cols_g on table1, you can use table1$cols_g assigned anything from these 3 methods instead of cols_g.