how to select matrix element in R? - r

Reading the data the following way
data<-read.csv("userStats.csv", sep=",", header=F)
I tried to select an element at the specific position.
The example of the data (first five rows) is the following (V2 is the date and V3 is the day of week):
V1 V2
1 00002781A2ADA816CDB0D138146BD63323CCDAB2 2010-09-04
2 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-04
3 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-07
4 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-08
5 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-17
V3 V4 V5 V6 V7 V8 V9
1 Saturday 2 2 615 1 1 47
2 Saturday 2 2 77 1 1 43
3 Tuesday 1 3 201 1 1 117
4 Wednesday 1 1 44 1 1 74
5 Friday 1 1 3 1 1 18
I tried to divide 6th column with 9th column in the first row the following way:
data[1,6]/data[1,9]
but it returned an error
[1] NA
Warning message:
In Ops.factor(data[1, 6], data[1, 9]) : / not meaningful for factors
Then I tried to select just one element
> data[2,9]
[1] 43
11685 Levels: 0 1 2 3 ... 55311
but don't know what these Levels are and what causes an error. Does anyone know how to select an element at the specific position data[row, column]?
Thank you!

My favorite tool to check variable class is str().
What you have there is a data frame and at least one of the columns you're trying to work with is a factor. See Dirk's answer on how to change classes of a column.
Command
data[1,6]/data[1,9]
is selecting the value in the first row of sixth column and dividing with the value in first row of the ninth column. Is this what you want? If you want to use values from the entire column (and not just the first row), you would write
data[6] / data[9]
or
data[, 6] / data[, 9]
Both arguments are equivalent for data.frames.

The standard modeling data structure in R is a data.frame.
The data.frame objects can hold various types: numeric, character, factor, ...
Now, when reading data via read.csv() et al, you can get bitten by the default valus of the stringsAsFactors option. I presume that at least a row in your data had text, so R decides to decode it as a factor and presto! you no longer can do direct mathematical operations on the column.
In short, do summary(data) and/or a sweep of class() over all the columns. Convert as necessary, or turn the stringsAsFactors variable to a different value or both.
Once your data is numeric, you can divide, slice, dice, ... as you please.

Related

can I select some rows in my data set whose have the same value in 2 of the columns?

I have a data set with 40 columns and 2000 rows. the value of 2 columns are important. I want to select rows whose have the same value in these 2 columns.
a small sample of my data is like this
2 3 4 5 6 3 23 32
4 3 4 1 0 5 6 43
4 4 3 22 1 2 23
Suppose I want to select rows whose have same value in first and third columns. So I want the second row to be stored in a new data set
I take from your comments that you have numbers stored as factors in that dataframe. Factors have different internal values. So when the console output shows the factor level to be 4 it is not necessarily a 4 in the internal representation. In general, two different factors are not compatible with each other except if they have the same level set. To see the 'internal representation' of your first column use as.numeric(df[[1]]).
Now to the solution of your problem. You first have to convert the factors in your columns 1 and 3 (or all columns) into numeric values using the factor levels. Instructions for that can be found here.
## converting factor levels to numeric values
df[[1]] <- as.numeric(levels(df[[1]]))[df[[1]]]
df[[3]] <- as.numeric(levels(df[[3]]))[df[[3]]]
## filter data
df[df[1] == df[3],]

Finding values in elements of a list greater than X

I have a list of elements called "find_gaps", below are the first 3 elements of the list:
$`2014-11-01 00:33:18`
1 1 1 1 1 1 1 1 1 118
$`2014-11-01 01:35:58` 1 1 1 1 1 1 1 1 1 116
$`2014-11-01 02:34:28` 1 25 25
I want to find values greater than or equal to 24 in each element, and have the output as a data frame where each column contains rows equal to the number of values greater than 24 for each list element. For example, the first element in "find_gaps" would correspond to a data frame column having only one row (with value 118). I am sure there is a way to do this, I have used the code below but I only get the position/index of the value in each list element greater than 24, and not the value itself:
greater_than_24<-lapply(find_gaps,function(x)which(x>=24))
greater_than_24<-unlist(lapply(find_gaps,function(x) length(which(x>=24))))
> as.data.frame(t(greater_than_24))
V1 V2 V3
1 1 1 2
Alternatively - this will pull off the values greater than 24 in each element of the list:
greater_than_24<-lapply(find_gaps,function(x) x[which(x>=24)])
> as.data.frame(t(greater_than_24))
V1 V2 V3
1 118 116 25, 25
This question already has an accepted answer, and the OP has described that he expects the output in wide form.
Nevertheless, I would like to propose a different approach which returns the result in long form (including the names of the list elements). I hope the OP finds this alternative representation of the result useful.
library(data.table)
data.table(find_gaps, name = names(find_gaps))[
, .(value = unlist(find_gaps)), by = name][value > 24]
name value
1: 2014-11-01 00:33:18 118
2: 2014-11-01 01:35:58 116
3: 2014-11-01 02:34:28 25
4: 2014-11-01 02:34:28 25

Take difference of two columns in R resulting in a new third column

So far I have a data frame that looks like this:
Account Total Mastered Not_Mastered
1 1 NA NA
2 12 2 10
3 4 NA NA
4 51 50 1
The code I have is:
Table$not_mastered = (Table$total - Table$mastered)
My goal is to subtract the 'mastered' column from the 'total' column to result in a third column 'not_mastered' and if there is no value in the 'mastered' column then I want the new column to have the same value as the 'total' column. Shown below.
Account Total Mastered Not_Mastered
1 1 NA 1
2 12 2 10
3 4 NA 4
4 51 50 1
How can I skip over the NA values in the mastered column and rewrite the values from the total column?
We can use replace to change the NA values to 0 and then do the difference
with(df1, Total - replace(Mastered, is.na(Mastered), 0))
#[1] 1 10 4 1
Depending on what kind of software you are using, you should be able to catch those with a simple if-loop.
for index=1: (number of rows of data) % looks at each row, one at a time
if Mastered(index)==NA % if the value is the Mastered column is NA
NotMastered(index)=Total(index);
else
NotMastered(index)=Total(index)-Mastered(index);
end
end

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

Substituting values in the date field (string) within a dataframe

It must be a very easy task, but I can't find the right line of code for this:
Data frame (df) has several columns (Date is the first one, containing string object), and around 200 rows.
Date V1
1 01/01/2011 5
2 02/01/2011 4
3 03/01/2011 2
...
200 05/09/2011
needs to become this (current year):
Date V1
1 01/01/2013 5
2 02/01/2013 4
3 03/01/2013 2
...
200 05/09/2013
Thanks!
df$Date <- sub('11$','13',df$Date)
should work.
But beware: naming a variable Date is a bad idea because R already has an internal data type with that name.

Resources