Extract specific observations of csv file in R - r

I've imported a csv file using read.csv.
It gives me a data frame with 18k observations of 1 variable, which looks like this:
V1
1 Energies (kJ/mol)
2 Bond Angle Proper Dih. Improper Dih. LJ-14
3 3.12912e+04 4.12307e+03 1.63677e+04 1.25619e+02 1.04394e+04
4 Coulomb-14 LJ (SR) Coulomb (SR) Potential Pressure (bar)
5 9.21339e+04 2.82339e+05 -1.15807e+06 -7.21252e+05 -7.25781e+03
6 Step Time Lambda
7 1 1.00000 0.00000
8 Energies (kJ/mol)
9 Bond Angle Proper Dih. Improper Dih. LJ-14
10 2.71553e+04 4.11858e+03 1.63855e+04 1.22226e+02 1.03903e+04
11 Coulomb-14 LJ (SR) Coulomb (SR) Potential Pressure (bar)
12 9.20926e+04 2.65253e+05 -1.15928e+06 -7.43766e+05 -7.27887e+03
13 Step Time Lambda
14 2 2.00000 0.00000
...
I want to extract the Potential energy in a vector. I've tried grep and readLines in multiple varieties and functions, but nothing works. Does anybody have an idea how to solve this problem?
Thanks! :)

So is this the right answer (from a former fizzsics major):
Lines <- readLines(textConnection("1 Energies (kJ/mol)
2 Bond Angle Proper Dih. Improper Dih. LJ-14
3 3.12912e+04 4.12307e+03 1.63677e+04 1.25619e+02 1.04394e+04
4 Coulomb-14 LJ (SR) Coulomb (SR) Potential Pressure (bar)
5 9.21339e+04 2.82339e+05 -1.15807e+06 -7.21252e+05 -7.25781e+03
6 Step Time Lambda
7 1 1.00000 0.00000
8 Energies (kJ/mol)
9 Bond Angle Proper Dih. Improper Dih. LJ-14
10 2.71553e+04 4.11858e+03 1.63855e+04 1.22226e+02 1.03903e+04
11 Coulomb-14 LJ (SR) Coulomb (SR) Potential Pressure (bar)
12 9.20926e+04 2.65253e+05 -1.15928e+06 -7.43766e+05 -7.27887e+03
13 Step Time Lambda
14 2 2.00000 0.00000"))
> grep("Potential", Lines) # identify the lines with "Potential"
[1] 4 11
Need to move to the next line and get the 5th item:
> read.table(text=Lines[ grep("Potential", Lines)+1])[ , 5]
[1] -721252 -743766

Related

Is there an R function to help me plot the network connections for a single node?

This is my original dataset. R1,R2 and R3 are word association responses for the cue word. tf and df are total and document frequency of the cue word, respectively.
[1]: https://i.stack.imgur.com/wpfZy.png [Image shows original dataframe}
I have cleaned up a dataset into a nodes list and an edge list. I have over a million rows in both lists. Plotting this as a network graph would take too long, and also be very dense, i.e. not understandable.
[2]: https://i.stack.imgur.com/mfSfN.png [Image shows node-list]
[3]: https://i.stack.imgur.com/l60Eu.png [Image shows edge-list]
I want to be able to make a network graph for the cue words, such that upon entering a cue word, I get a network of words that are either responses to it, or are words that the cue word is a response for.
For example, I want to see all the connections for the word 'money'. Using filter(nword == "money") only shows the node 'money' as an output, but I want all nodes connected to the cue word (in this case, 'money').
[4]: https://i.stack.imgur.com/1bKrr.png [Image shows filter()]
Is there a function or a chunk of code that would help me resolve this issue?
from
to
1
1
1
6
1
8
1
17
1
18
1
22
1
23
1
38
1
67
1
80
2
82736
2
88035
2
103428
3
11
3
27
3
45
node_id
nword
n
1
money
13633
2
food
12338
3
water
12276
4
car
8907
5
music
8351
6
green
7890
7
red
7623
8
love
7406
9
sex
6552
10
happy
6432
11
cold
6333
12
bad
6132
13
sad
5958
14
dog
5940
15
white
5910
16
school
5832
17
fun
5594
18
time
5467
19
black
5233
20
hair
5219

Stacking two data frame columns into a single separate data frame column in R

I will present my question in two ways. First, requesting a solution for a task; and second, as a description of my overall objective (in case I am overthinking this and there is an easier solution).
1) Task Solution
Data context: each row contains four price variables (columns) representing (a) the price at which the respondent feels the product is too cheap; (b) the price that is perceived as a bargain; (c) the price that is perceived as expensive; (d) the price that is too expensive to purchase.
## mock data set
a<-c(1,5,3,4,5)
b<-c(6,6,5,6,8)
c<-c(7,8,8,10,9)
d<-c(8,10,9,11,12)
df<-as.data.frame(cbind(a,b,c,d))
## result
# a b c d
#1 1 6 7 8
#2 5 6 8 10
#3 3 5 8 9
#4 4 6 10 11
#5 5 8 9 12
Task Objective: The goal is to create a single column in a new data frame that lists all of the unique values contained in a, b, c, and d.
price
#1 1
#2 3
#3 4
#4 5
#5 6
...
#12 12
My initial thought was to use rbind() and unique()...
price<-rbind(df$a,df$b,df$c,df$d)
price<-unique(price)
...expecting that a, b, c and d would stack vertically.
[Pseudo illustration]
a[1]
a[2]
a[...]
a[n]
b[1]
b[2]
b[...]
b[n]
etc.
Instead, the "columns" are treated as rows and stacked horizontally.
V1 V2 V3 V4 V5
1 1 5 3 4 5
2 6 6 5 6 8
3 7 8 8 10 9
4 8 10 9 11 12
How may I stack a, b, c and d such that price consists of only one column ("V1") that contains all twenty responses? (The unique part I can handle separately afterwards).
2) Overall Objective: The Bigger Picture
Ultimately, I want to create a cumulative share of population for each price (too cheap, bargain, expensive, too expensive) at each price point (defined by the unique values described above). For example, what percentage of respondents felt $1 was too cheap, what percentage felt $3 or less was too cheap, etc.
The cumulative shares for bargain and expensive are later inverted to become not.bargain and not.expensive and the four vectors reside in a data frame like this:
buckets too.cheap not.bargain not.expensive too.expensive
1 0.01 to 0.50 0.000000000 1 1 0
2 0.51 to 1.00 0.000000000 1 1 0
3 1.01 to 1.50 0.000000000 1 1 0
4 1.51 to 2.00 0.000000000 1 1 0
5 2.01 to 2.50 0.001041667 1 1 0
6 2.51 to 3.00 0.001041667 1 1 0
...
from which I may plot something that looks like this:
Above, I accomplished my plotting objective using defined price buckets ($0.50 ranges) and the hist() function.
However, the intersections of these lines have meanings and I want to calculate the exact price at which any of the lines cross. This is difficult when the x-axis is defined by price range buckets instead of a specific value; hence the desire to switch to exact values and the need to generate the unique price variable.
[Postscript: This analysis is based on Peter Van Westendorp's Price Sensitivity Meter (https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter) which has known practical limitations but is relevant in the context of my research which will explore consumer perceptions of value under different treatments rather than defining an actual real-world price. I mention this for two reasons 1) to provide greater insight into my objective in case another approach comes to mind, and 2) to keep the thread focused on the mechanics rather than whether or not the Price Sensitivity Meter should be used.]
We can unlist the data.frame to a vector and get the sorted unique elements
sort(unique(unlist(df)))
When we do an rbind, it creates a matrix and unique of matrix calls the unique.matrix
methods('unique')
#[1] unique.array unique.bibentry* unique.data.frame unique.data.table* unique.default unique.IDate* unique.ITime*
#[8] unique.matrix unique.numeric_version unique.POSIXlt unique.warnings
which loops through the rows as the default MARGIN is 1 and then looks for unique elements. Instead, if we use the 'price', either as.vector or c(price) converts into vector
sort(unique(c(price)))
#[1] 1 3 4 5 6 7 8 9 10 11 12
If we use unique.default
sort(unique.default(price))
#[1] 1 3 4 5 6 7 8 9 10 11 12

In R, how can I find unique pairs of numbers across rows when there are three columns?

I am working with triangular meshes in R. For those not familiar, the PLY format has two main components, a 3 by n matrix of vertex x,y,z coordinates, where n is the number of vertices, and a 3 by m matrix of faces where each number references one line from the vertex matrix, and so defining three corners of a triangular face. I am trying to find the mesh boundary edges, which are the "sides" of the triangles that are only referenced once in the faces matrix.
Therefore my question is, how do I find unique pairs of numbers across rows where there are three columns?
face 1 4 6 7
face 2 7 6 8
face 3 9 11 12
face 4 10 9 12
Here line (face) 1 has the edge 4-7 that only appears once, while 6-7 appears twice, as does 9-12.
unique() works across rows, but looks for unique rows, and expects the numbers to be in the same order. Any suggestions?
What you want to do is hash each pair, then make a table of the hashes. You also want (x,y)
to hash the same as (y,x).
R>data
V1 V2 V3 V4 V5
1 face 1 4 6 7
2 face 2 7 6 8
3 face 3 9 11 12
4 face 4 10 9 12
R>e1 <- pmin(data[3], data[4]) + pmax(data[3], data[4])/100
R>e2 <- pmin(data[3], data[5]) + pmax(data[3], data[5])/100
R>e3 <- pmin(data[4], data[5]) + pmax(data[4], data[5])/100
R>table(c(e1,e2,e3, recursive=TRUE))
4.06 4.07 6.07 6.08 7.08 9.1 9.11 9.12 10.12 11.12
1 1 2 1 1 1 1 2 1 1

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

Is there a faster way to get percent change?

I have a data frame with around 25000 records and 10 columns. I am using code to determine the change to the previous value in the same column (NewVal) based on another column (y) with a percent change already in it.
x=c(1:25000)
y=rpois(25000,2)
z=data.frame(x,y)
z[1,'NewVal']=z[1,'x']
So I ran this:
for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+(z$NewVal[i-1]*(z$y[i]/100))}
This takes considerably longer than I expected it to. Granted I may be an impatient person - as a scathing letter drafted to me once said - but I am trying to escape the world of Excel (after I read http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html, which is causing me more problems as I have begun to mistrust data - that letter also mentioned my trust issues).
I would like to do this without using any of the functions from packages as I would like to know what the formula for creating the values is - or if you will, I am a demanding control freak according to that friendly missive.
I would also like to know how to get a moving average just like rollmean in caTools. Either that or how do I figure out what their formula is? I tried entering rollmean and I think it refers to another function (I am new to R). This should probably be another question - but as that letter said, I don't ever make the right decisions in my life.
The secret in R is to vectorise. In your example you can use cumprod to do the heavy lifting:
z$NewVal2 <- x[1] * cumprod(with(z, 1 +(c(0, y[-1]/100))))
all.equal(z$NewVal, z$NewVal2)
[1] TRUE
head(z, 10)
x y NewVal NewVal2
1 25 4 25.00000 25.00000
2 24 3 25.75000 25.75000
3 23 0 25.75000 25.75000
4 22 1 26.00750 26.00750
5 21 3 26.78773 26.78773
6 20 2 27.32348 27.32348
7 19 2 27.86995 27.86995
8 18 3 28.70605 28.70605
9 17 4 29.85429 29.85429
10 16 2 30.45138 30.45138
On my machine, the loop takes just less than 3 minutes to run, while the cumprod statement is virtually instantaneous.
I got about a 800-fold improvement with Reduce:
system.time(z[, "NewVal"] <-Reduce("*", c(1, 1+z$y[-1]/100), accumulate=T) )
user system elapsed
0.139 0.008 0.148
> head(z)
x y NewVal
1 1 1 1.000
2 2 1 1.010
3 3 1 1.020
4 4 5 1.071
5 5 1 1.082
6 6 2 1.103
7 7 2 1.126
8 8 3 1.159
9 9 0 1.159
10 10 1 1.171
> system.time(for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+
(z$NewVal[i-1]*(z$y[i]/100))})
user system elapsed
37.29 106.38 143.16

Resources