R - data munging and scalable code [closed]

R - data munging and scalable code [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Hy,
in the last days I had a small/big problem.
I have a transaction dataset, with 1 million rows and two columns (Client Id and product id) and I want transform this in a binary matrix.
I used reshape and spread function, but in both cases I used 64mb ram and Rstudio/R goes down.
Because I only use 1 CPU, the process takes a lot of time
My question is, what is it the new steep forward in this transition between small and big data? Who can I use more cpu?
I search and I found a couple of solution but I need a expertise opinion
1 - Using Spark R?
2 - H20.ai solution? http://h2o.ai/product/enterprise-support/
3 - Revolution analytics? http://www.revolutionanalytics.com/big-data
4 - go to the cloud? like microsoft azure?
If I needed I can use a virtual machine with a lot of cores.. but I need to know what is the smooth way to make this transaction
My specific problem
I have this data.frame (but with 1 million rows)
Sell<-data.frame(UserId = c(1,1,1,2,2,3,4), Code = c(111,12,333,12,111,2,3))
and I did:
Sell[,3] <-1
test<-spread(Sell, Code, V3)
this works with a little data set.. but with 1 million rows this takes a long time (12 hours) and goes down because my maximum ram is 64MB. Any suggestions?

You don't tell what you want to do with the result, but the most efficient way to create such a matrix would be creating a sparse matrix.
This is a dense matrix-like object that wastes a lot of RAM for all these NA values.
test
# UserId 2 3 12 111 333
#1 1 NA NA 1 1 1
#2 2 NA NA 1 1 NA
#3 3 1 NA NA NA NA
#4 4 NA 1 NA NA NA
You can avoid this with a sparse matrix, which internally is still basically a long-format structure, but has methods for matrix operations.
library(Matrix)
Sell[] <- lapply(Sell, factor)
test1 <- sparseMatrix(i = as.integer(Sell$UserId),
j = as.integer(Sell$Code),
x = rep(1, nrow(Sell)),
dimnames = list(levels(Sell$UserId),
levels(Sell$Code)))
#4 x 5 sparse Matrix of class "dgCMatrix"
# 2 3 12 111 333
#1 . . 1 1 1
#2 . . 1 1 .
#3 1 . . . .
#4 . 1 . . .
You would need even less RAM with a logical sparse matrix:
test2 <- sparseMatrix(i = as.integer(Sell$UserId),
j = as.integer(Sell$Code),
x = rep(TRUE, nrow(Sell)),
dimnames = list(levels(Sell$UserId),
levels(Sell$Code)))
#4 x 5 sparse Matrix of class "lgCMatrix"
# 2 3 12 111 333
#1 . . | | |
#2 . . | | .
#3 | . . . .
#4 . | . . .

I'm not sure this is a coding question...BUT...
The new Community Preview of SQL Server 2016 has R built in on the server, and you can get download the preview to try here: https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016
Doing this will bring your R code to your data and run on top of the SQL engine, allowing for the same sort of scalability you get built in with SQL.
Or you can stand up a VM in Azure, by going to the new portal, selecting "New" "Virtual Machine" and search for "SQL"

Related

How to remove duplicates right beneath original response? [duplicate]

This question already has answers here:
Remove repeated numbers in a sequence
(3 answers)
Closed 2 years ago.
Background: I have a survey attached to a excel sheet and at times duplication of a response takes place. This is due to user interaction.The duplication takes place right beneath the original response. I would like the R to delete the duplications that takes place next to/right beneath the original response. I would like the original response to be kept. Is there a way to target the duplicated responses right beneath the original one?
If my dataframe looks this:
Area Year Course Tested Grade
1 Git 1 Material Y A
2 Ort 3 Fabric Y B
3 Pinst 2 Pattern N NA
4 Coker 1 Fashion Y B+
5 Coker 1 Fashion Y B+
6 South 4 Business N NA
This is what I would want:
Area Year Course Tested Grade
1 Git 1 Material Y A
2 Ort 3 Fabric Y B
3 Pinst 2 Pattern N NA
4 Coker 1 Fashion Y B+
5 South 4 Business N NA
Thank you in advance

Assuming you want to only delete the duplicates if it happens in consecutive rows and keep it if they happen elsewhere you can use rleidv along with duplicated :
df[!duplicated(data.table::rleidv(df)),]
# Area Year Course Tested Grade
#1 Git 1 Material Y A
#2 Ort 3 Fabric Y B
#3 Pinst 2 Pattern N <NA>
#4 Coker 1 Fashion Y B+
#6 South 4 Business N <NA>

Sum-product in R for specific conditions

I'm looking to do sumproduct in r as we do in excel.
It's a little challenging as i have to apply some logical conditions meanwhile.
Excel code looks like this
SUMPRODUCT(--(ID=A2),--(INDIRECT(A1)<>"-"),INDIRECT(B1),C1)
here ID, A1 ,B1 are name ranges on other sheet of same workbook.
ID $ Quantity
1 23 34
2 4 55
3 NA 6
4 6 45
5 7 NA
6 8 NA
I want logical operators because some values are NA and i don't want to take them in consideration. I want this process to be automated without much manual work.
I've done this upto some extent using deplyr but it's not giving satisfactory results.

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you

This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]

I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA

split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

code issue with developing a sentiment analysis scoring model

I am trying to do some sentiment analysis on twitter data. I have a dictionary (afinn_list) which is something like below
good 5
bad -5
awesome 6
I have been able to generate a character variable which contains the location of each matched word. Now I want to generate a score variable which will contain the corresponding score for these matches. I am having hard time coming up with a for loop logic.
class(afinn_list)
[1] "data.frame"
vPosMatches <- match(words, afinn_list$word)
vPosMatches
[1] NA NA NA NA 1104 NA NA NA NA NA NA NA NA NA NA NA NA 1836 NA
I am sorry if the question is too naive. I am just trying to learn sentiment analysis using R.

Sentiment analysis is a complex task. Assuming you have clean up your data from twitter and storing it as 1 word in each cell, I guess what you are lacking now is score your cleaned up data in words with your scoring "dictionary" afinn_list.
Assuming that your words is a afinn_list looks like this
dictionary <-data.frame(grade=c('bad','not good', 'ok', 'good','very good'), score=1:5))
# grade score
1 bad 1
2 not good 2
3 ok 3
4 good 4
5 very good 5
and your mock_data ( clean up data from twitter) is
mock_data<-data.frame(data=rep(x=c('good','bad','rubbish','hello','very good'),10))
# data
1 good
2 bad
3 rubbish
4 hello
5 very good
6 good
You will do a merge between 2 data frame. In SQL world, it will be an left outer join . In R, it is impletemed with the function merge and providing the column you wish to join by and all.x=True
Hence your code will look like this
merge(mock_data, dictionary, by='data', all.x=TRUE)
I hope this answer you question.
Cheers

Sqlite - store matrices in a table

I am quite new to Sqlite and have a dilemma about database design. Suppose we have a number of matrices (of various sizes) that is going to be stored in a table. We can further assume that no matrise is sparse.
Let's say we have:
A = [[1, 4, 5],
[8, 1‚ 4],
[1, 1, 3]]
B = [['what', 'a', 'good', 'day'],
['for', 'a', 'walk', 'outside']]
C = [['AAA', 'BBB', 'CCC', 'DDD', 'EEE'],
['FFF', 'GGG', 'HHH', 'III', 'JJJ'],
['KKK', 'LLL', 'MMM', 'NNN', 'OOO']]
And D which is [NxM]
When we create the table we do not know all the sizes that the matrices will have. I do not think it would be nice to alter the table size afterwards. What would be a recommended way to store the matrices to efficiently get them back? I wish to query out a matrix row-by-row.
I am thinking of transforming matrices into a column vector that somehow ends up in a table like this,
CREATE TABLE mat(id INT,
row INT,
col INT,
val TEXT)
How can I get them back line by line with a query in sqlite that looks like this for matrix A?
[1, 4, 5]
[8, 1‚ 4]
[1, 1, 3]
Ideas? Or could someone kindly refer to any similar problems
---------------------- UPDATE ----------------------
Okay. My question was not clear enough. That is probably the way I'm intended to arrange the data in my database. I hope you can help me find a way to organize my database,
Suppose we have some sets of data:
Compilation User BogoMips
1 Andrew 1.04
1 Klaus 1.78
1 James 1.99
1 David 2.09
. . .
. . .
1 Alex 4.71
Compilation Time Temperature Colour
2 10:20 10 Blue
2 10:28 21 Green
2 10:42 25 Red
. . . .
. . . .
2 18:16 16 Green
Compilation Colour Distance
3 Blue 4
3 Green 9
. . .
. . .
3 Yellow 12
...And there will be many more sets of data with different numbers columns and new headers. Some header names will return in another set. In advance, we have no idea what kind of sets needs to be stored. Every set has a common header 'compilation' that binds them together.
How would you structure the data in a database?
I find it hard to believe that creating a new table for each set is a good solution. or?
My idea is to have two tables, headers and data.
CREATE TABLE headers (id INT,
header TEXT
)
CREATE TABLE data (id INT,
compilation INT,
fk_header_id INT REFERENCES headers,
row INT,
col INT,
value TEXT)
So the populated tables looks like this,
SELECT * FROM headers;
id header
------------
1 User
2 BogoMips
3 Time
4 Temperature
5 Colour
6 Distance
SELECT * FROM data;
id compilation fk_header_id row col value
----------------------------------------------------
1 1 1 1 1 Andrew
2 1 2 1 2 1.04
3 1 1 2 1 Klaus
4 1 2 2 2 1.78
. . . . . .
. 2 3 1 1 10:20
. 2 4 1 2 10
. 2 5 1 3 Blue
. 2 3 2 1 10:28
. 2 4 2 2 21
. 2 5 2 3 Green
. . . . . .
. 3 5 1 1 Blue
. 3 6 1 2 4
. 3 5 2 1 Green
. 3 6 2 2 9
. . . . . .
.
and so on
The problem is that I don't know how to query out the datasets in Sqlite. Anyone (Tony) have an idea?

You'd need a pivot / cross tab query (or it's join equivalent) to get the data out.
e.g
Select c1.value as col1, c2.value as col2, c3.value as col3
from data c1 on col = 1
inner join data c2 on col = 2 and c2.compilation = c1.compilation and c2.row = c1.row
inner join data c3 on col= 3 and c3.compilation = c1.compilation and c3.row = c1.row
Where c1.compilation = 1 order by c1.row
As you can see this is less than funny. In particular with the above, you'd have to know the number of columns in advance. Crosstab or pivot would relieve you of that in terms of the sql, but you'd still have to mess about to read in the data from the query result.
Haven't seen anything is your question that indicates a need to extract a row or a column from a matrix, never mind a cell from the db
My Table would start as simple as
Compilation, Description, Matrix
Matrix would be sort of serialisation of a matrix object, Binary, xml even some sort of string eg. 1,2,3|4,5,6|7,8,9
If this was all I needed to store, I'd be looking at a NoSQL variant.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - data munging and scalable code [closed] - r

Related

How to remove duplicates right beneath original response? [duplicate]

Sum-product in R for specific conditions

Subsetting in R using a list

code issue with developing a sentiment analysis scoring model

Sqlite - store matrices in a table

Categories

Resources