Coloring dgCMatrix image by factor in R - r

I am trying to color a sparse matrix image according to a grouping factor. I know the solution is related to matrix coloring in the lattice package but I have troubles to handle it.
I have a list of hits on an app list. Every hit is related to a user and a app at a specific time.
- On the y axis are users sorted by first install of the app
Every user then has a new line for his pages hits
- On the x axis is the time
Points are hits
Here is a preview of the data:
library(Matrix)
indexUser indexInstall time
1 1 1 3
2 1 1 17
3 1 1 19
4 1 1 32
5 1 1 81
6 1 1 86
7 1 1 124
8 1 1 231
9 1 1 233
10 1 2 249
11 2 3 4
12 2 3 6
13 2 3 7
14 2 3 15
15 2 3 25
16 2 3 32
17 2 3 45
18 2 3 74
19 2 3 75
20 3 4 36
21 3 4 37
22 3 4 113
23 4 5 69
24 4 5 70
25 4 5 71
I then create a sparse matrix as the full dataset is way larger than that (10000+ x 1000)
sM <- sparseMatrix(i=dat$indexInstall, j=dat$time, x=1)
And show an image of it:
image(sM)
I want to color every lines according to the indexUser column. For example to plot user 1 in blue and all others un red
Thanks in advance

Related

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

Cumulative function for a specific range of values

I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

R code is not creating objects?

I have written some code for a university assignment. The assignment is based on various concrete samples and their tensile strengths. There are 20 types of concrete mixtures (made from four different accelerators, and five different plasticisers). Our job is to do a statistical analysis on this data frame:
TStrength accelerator plasticiser
1 3.417543 1 1
2 2.887113 1 2
3 3.600988 1 3
4 3.702631 1 4
5 3.686944 1 5
6 3.699785 1 1
7 3.112972 1 2
8 3.918160 1 3
9 3.600538 1 4
10 2.748832 1 5
11 3.404498 1 1
12 3.735437 1 2
13 3.347577 1 3
14 3.101556 1 4
15 3.527621 1 5
16 3.856831 1 1
17 3.492118 1 2
18 3.928343 1 3
19 3.511689 1 4
20 3.371985 1 5
21 3.069794 2 1
22 3.168010 2 2
23 3.316657 2 3
24 3.455162 2 4
25 2.818250 2 5
26 4.054507 2 1
27 3.065984 2 2
28 3.201351 2 3
29 3.417554 2 4
30 3.364320 2 5
31 3.218677 2 1
32 2.647151 2 2
33 3.222705 2 3
34 3.145210 2 4
35 3.636642 2 5
36 3.317620 2 1
37 3.645922 2 2
38 2.556071 2 3
39 3.177663 2 4
40 3.014374 2 5
41 3.838183 3 1
42 4.155951 3 2
43 3.886330 3 3
44 3.723898 3 4
45 4.425442 3 5
46 3.738460 3 1
47 3.217834 3 2
48 3.942241 3 3
49 3.699851 3 4
50 3.797089 3 5
51 3.652456 3 1
52 4.851609 3 2
53 3.359099 3 3
54 4.089559 3 4
55 4.282991 3 5
56 3.803784 3 1
57 3.519551 3 2
58 3.935084 3 3
59 3.890324 3 4
60 4.611936 3 5
61 3.343098 4 1
62 3.713952 4 2
63 3.629883 4 3
64 3.082509 4 4
65 3.346548 4 5
66 3.277845 4 1
67 3.509506 4 2
68 3.490567 4 3
69 3.235009 4 4
70 3.970925 4 5
71 3.504646 4 1
72 3.270798 4 2
73 3.547298 4 3
74 3.278489 4 4
75 3.322743 4 5
76 2.975010 4 1
77 3.384996 4 2
78 3.399486 4 3
79 3.703567 4 4
80 3.214973 4 5
My first step was to attempt to find out the means of the Tstrength values for each of the 20 concrete types (there are four types of each unique concrete sample). I am very new to R, and my code is certainly not beautiful, but this is the code I wrote to find the means:
#Setting the correct directory
setwd("C:/Users/Matthew/Desktop/Work/Engineering")
#Creating the data frame object, Concrete.
#Note that this will only work if the file
#s...-CW.dat is in the current working directory
#Therefore for this code to work, CreateData.r must
#be run on the individual computer with the
#given matriculation number, and the file must be saved
#in the specified directory
Concrete<-read.table(file='s...-CW.dat',header=TRUE)
#Since the samples of concrete are made from 4 different accelerators and
#5 different plasticisers there will be 4*5=20 unique combinations from
#which concrete samples can come from (i.e. 1,1; 1,2; 4,5 etc).
# There are four samples of each combination
#The next section of code is used to find the mean of the four samples,
#for each combination (20 total)
#creating a list with Tstrength from all (1,1) combinations
#Then finding average
combo1 = list(Concrete[1,1],Concrete[6,1],Concrete[11,1],Concrete[16,1])
combo1mean = mean(unlist(combo1))
#Repeating for (1,2)
combo2 = list(Concrete[2,1],Concrete[7,1],Concrete[12,1],Concrete[17,1])
combo2mean = mean(unlist(combo2))
#Repeating for (1,3)
combo3 = list(Concrete[3,1],Concrete[8,1],Concrete[13,1],Concrete[18,1])
combo3mean = mean(unlist(combo3))
#Repeating for (1,4)
combo4 = list(Concrete[4,1],Concrete[9,1],Concrete[14,1],Concrete[19,1])
combo4mean = mean(unlist(combo4))
#Repeating for (1,5)
combo5 = list(Concrete[5,1],Concrete[10,1],Concrete[15,1],Concrete[20,1])
combo5mean = mean(unlist(combo5))
#Repeating for (2,1)
combo6 = list(Concrete[21,1],Concrete[26,1],Concrete[31,1],Concrete[36,1])
combo6mean = mean(unlist(combo6))
#Repeating for (2,2)
combo7 = list(Concrete[22,1],Concrete[27,1],Concrete[32,1],Concrete[37,1])
combo7mean = mean(unlist(combo7))
#Repeating for (2,3)
combo8 = list(Concrete[23,1],Concrete[28,1],Concrete[33,1],Concrete[38,1])
combo8mean = mean(unlist(combo8))
#Repeating for (2,4)
combo9 = list(Concrete[24,1],Concrete[29,1],Concrete[34,1],Concrete[39,1])
combo9mean = mean(unlist(combo9))
#Repeating for (2,5)
combo10 = list(Concrete[25,1],Concrete[30,1],Concrete[35,1],Concrete[40,1])
combo10mean = mean(unlist(combo10))
#Repeating for (3,1)
combo11 = list(Concrete[41,1],Concrete[46,1],Concrete[51,1],Concrete[56,1])
combo11mean = mean(unlist(combo11))
#Repeating for (3,2)
combo12 = list(Concrete[42,1],Concrete[47,1],Concrete[52,1],Concrete[57,1])
combo12mean = mean(unlist(combo12))
#Repeating for (3,3)
combo13 = list(Concrete[43,1],Concrete[48,1],Concrete[53,1],Concrete[58,1])
combo13mean = mean(unlist(combo13))
#Repeating for (3,4)
combo14 = list(Concrete[44,1],Concrete[49,1],Concrete[54,1],Concrete[59,1])
combo14mean = mean(unlist(combo14))
#Repeating for (3,5)
combo15 = list(Concrete[45,1],Concrete[50,1],Concrete[55,1],Concrete[60,1])
combo15mean = mean(unlist(combo15))
#Repeating for (4,1)
combo16 = list(Concrete[61,1],Concrete[66,1],Concrete[71,1],Concrete[76,1])
combo16mean = mean(unlist(combo16))
#Repeating for (4,2)
combo17 = list(Concrete[62,1],Concrete[67,1],Concrete[72,1],Concrete[77,1])
combo17mean = mean(unlist(combo17))
#Repeating for (4,3)
combo18 = list(Concrete[63,1],Concrete[68,1],Concrete[73,1],Concrete[78,1])
combo18mean = mean(unlist(combo18))
#Repeating for (4,4)
combo19 = list(Concrete[64,1],Concrete[69,1],Concrete[74,1],Concrete[79,1])
combo19mean = mean(unlist(combo19))
#Repeating for (4,5)
combo20 = list(Concrete[65,1],Concrete[70,1],Concrete[75,1],Concrete[80,1])
combo20mean = mean(unlist(combo20))
A few notes about the code: "s..." is just my matriculation number. I have triple checked that I have not made a mistake here regarding either the file name or the directory with where it is stored. CreataData.r is just a script provided to us the generates the data used to create 'Concrete' based on our matriculation number (so we're not just blindly copying each other I suppose).
The problem I am having with the code is that whenever it runs, the object Concrete is created, as is combo1mean, combo2mean and combo3mean. However, I just cannot figure out why the rest of the objects aren't being created.
I have had no success using running the script in the Rgui. After running the script, it tells I check that Concrete has initialised, and I check to see if the combo4mean and above have initialised too, but they never do. I thought it maybe had to do with running the wrong file, or that I hadn't saved the data properly, but the script definitely contains all the code, and I created a new file to see if that would work, but unfortunately it didn't. Also, I have read an introduction to R by W.N. Venables, D.M. Smith and the R Core Team, but nothing there has helped me figure this out.
PS I am not doing this as an easy way out of homework. I have genuinely tried to figure out what is going wrong but I cannot seem to find the problem. I also apologise if the question is inaccurate in anyway, or if I have had misunderstandings, I am very new to R and am trying my best to learn it! Cheers in advance.
EDIT: Just in case anyone is curious, I managed to get the exact same code to work on a different computer, starting from an empty workspace. I'm still not very sure why it didn't work on the first computer, but thanks 42 for the code suggestions.
Adding code that should bypass issues related to reading a text file. This shouls succeed on any R installation:
Concrete <- read.table(text="TStrength accelerator plasticiser
1 3.417543 1 1
2 2.887113 1 2
3 3.600988 1 3
4 3.702631 1 4
5 3.686944 1 5
6 3.699785 1 1
7 3.112972 1 2
8 3.918160 1 3
9 3.600538 1 4
10 2.748832 1 5
11 3.404498 1 1
12 3.735437 1 2
13 3.347577 1 3
14 3.101556 1 4
15 3.527621 1 5
16 3.856831 1 1
17 3.492118 1 2
18 3.928343 1 3
19 3.511689 1 4
20 3.371985 1 5
21 3.069794 2 1
22 3.168010 2 2
23 3.316657 2 3
24 3.455162 2 4
25 2.818250 2 5
26 4.054507 2 1
27 3.065984 2 2
28 3.201351 2 3
29 3.417554 2 4
30 3.364320 2 5
31 3.218677 2 1
32 2.647151 2 2
33 3.222705 2 3
34 3.145210 2 4
35 3.636642 2 5
36 3.317620 2 1
37 3.645922 2 2
38 2.556071 2 3
39 3.177663 2 4
40 3.014374 2 5
41 3.838183 3 1
42 4.155951 3 2
43 3.886330 3 3
44 3.723898 3 4
45 4.425442 3 5
46 3.738460 3 1
47 3.217834 3 2
48 3.942241 3 3
49 3.699851 3 4
50 3.797089 3 5
51 3.652456 3 1
52 4.851609 3 2
53 3.359099 3 3
54 4.089559 3 4
55 4.282991 3 5
56 3.803784 3 1
57 3.519551 3 2
58 3.935084 3 3
59 3.890324 3 4
60 4.611936 3 5
61 3.343098 4 1
62 3.713952 4 2
63 3.629883 4 3
64 3.082509 4 4
65 3.346548 4 5
66 3.277845 4 1
67 3.509506 4 2
68 3.490567 4 3
69 3.235009 4 4
70 3.970925 4 5
71 3.504646 4 1
72 3.270798 4 2
73 3.547298 4 3
74 3.278489 4 4
75 3.322743 4 5
76 2.975010 4 1
77 3.384996 4 2
78 3.399486 4 3
79 3.703567 4 4
80 3.214973 4 5", header=TRUE)
This probably does what you are attempting with about 1/10th (or less) code (and more importantly no errors):
> means.by.type <- with( Concrete, tapply(TStrength,
list( acc=accelerator, plas=plasticiser),
FUN=mean))
> means.by.type
plas
acc 1 2 3 4 5
1 3.594664 3.306910 3.698767 3.479103 3.333845
2 3.415150 3.131767 3.074196 3.298897 3.208397
3 3.758221 3.936236 3.780689 3.850908 4.279364
4 3.275150 3.469813 3.516808 3.324893 3.463797
Importantly, you forgot to offer str or dput on Concrete, so cannot really tell whether you problem is data-prep or coding.

extract a column based on other two column

ID MON in out
2 1 23 12
3 1 23 12
7 1 33 22
1 2 22 11
2 2 111 100
1 3 21 10
2 3 22 11
2 4 111 100
7 4 21 10
2 5 31 20
7 2046 41 30
I have a large data set in this format. I want to extract column four for the value of column 1==2 and column 2 smaller then 5.
It's basic R.
df[,4][df[,1]==2 & df[,2]<5]

Resources