How to add two specific columns from a colSums table in r? - r

I made a frequency table with two variables in a data frame using this:
table(df$Variable1, df$Variable2)
The output was this:
1 2 3 4 5 D R
1 5000 21 39 2 10 0 112
2 1028 11 18 4 8 1 54
3 1501 6 12 2 3 0 68
4 355 2 4 0 0 0 23
5 421 4 4 0 0 0 49
Then I wanted to find the sum of the first two columns so I did this:
colSums(table(df$Variable1, df$Variable2))
The output was this:
1 2 3 4 5 D R
8305 44 77 8 21 1 306
Is there a way to find the sum of columns 1 and 2 from the colSums output above? What would the code be? Thanks in advance.

Related

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

transform values in data frame, generate new values as 100 minus current value

I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss.
However, I am now thinking that a nice plot would be a degradation curve. So, given the following example;
>losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size")
> full_losses_data
Divisions Accuracy 20 15 10 5 2
1 0 0 0 0 3 25
2 0 0 0 1 10 39
3 0 0 1 3 17 48
4 0 0 1 5 23 55
5 0 1 3 8 29 60
6 0 1 4 11 34 64
7 0 2 5 13 38 67
8 0 3 7 16 42 70
9 0 4 9 19 45 72
10 0 5 11 22 48 74
Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column:
fld <- full_losses_data
fld[, 2:ncol(fld)] <- 100 - fld[, -1]

R: calculate and superimpose on a ggplot graph minute-based totals for an event series

Consider a data frame df with an extract from a web server access log, with two fields (sample below, duration is in msec and to simplify the example, let's ignore the date).
time,duration
18:17:26.552,8
18:17:26.632,10
18:17:26.681,12
18:17:26.733,4
18:17:26.778,5
18:17:26.832,5
18:17:26.889,4
18:17:26.931,3
18:17:26.991,3
18:17:27.040,5
18:17:27.157,4
18:17:27.209,14
18:17:27.249,4
18:17:27.303,4
18:17:27.356,13
18:17:27.408,13
18:17:27.450,3
18:17:27.506,13
18:17:27.546,3
18:17:27.616,4
18:17:27.664,4
18:17:27.718,3
18:17:27.796,10
18:17:27.856,3
18:17:27.909,3
18:17:27.974,3
18:17:28.029,3
qplot(time, duration, data=df); gives me a graph of the duration. I'd like to add, superimposed a line showing the number of requests for each minute. Ideally, this line would have a single data point per minute, at the :30sec point. If that's too complicated, an acceptable alternative is to have a step line, with the same value (the count of request) during a minute.
One way is to trunc(df$time, units=c("mins")), then calculate the count of request per minute into a new column then graph it.
I'm asking if there is, perhaps, a more direct way to accomplish the above. Thanks.
Following may be helpful. Create a data frame with steps and plot:
time duration sec sec2 diffsec2 step30s steps
1 18:17:26.552 8 26.552 552 0 0 0
2 18:17:26.632 10 26.632 632 80 1 1
3 18:17:26.681 12 26.681 681 49 0 0
4 18:17:26.733 4 26.733 733 52 1 1
5 18:17:26.778 5 26.778 778 45 0 0
6 18:17:26.832 5 26.832 832 54 1 1
7 18:17:26.889 4 26.889 889 57 1 2
8 18:17:26.931 3 26.931 931 42 0 0
9 18:17:26.991 3 26.991 991 60 1 1
10 18:17:27.040 5 27.040 040 -951 0 0
11 18:17:27.157 4 27.157 157 117 1 1
12 18:17:27.209 14 27.209 209 52 1 2
13 18:17:27.249 4 27.249 249 40 0 0
14 18:17:27.303 4 27.303 303 54 1 1
15 18:17:27.356 13 27.356 356 53 1 2
16 18:17:27.408 13 27.408 408 52 1 3
17 18:17:27.450 3 27.450 450 42 0 0
18 18:17:27.506 13 27.506 506 56 1 1
19 18:17:27.546 3 27.546 546 40 0 0
20 18:17:27.616 4 27.616 616 70 1 1
21 18:17:27.664 4 27.664 664 48 0 0
22 18:17:27.718 3 27.718 718 54 1 1
23 18:17:27.796 10 27.796 796 78 1 2
24 18:17:27.856 3 27.856 856 60 1 3
25 18:17:27.909 3 27.909 909 53 1 4
26 18:17:27.974 3 27.974 974 65 1 5
27 18:17:28.029 3 28.029 029 -945 0 0
>
> ggplot(ddf)+geom_point(aes(x=time, y=duration))+geom_line(aes(x=time, y=steps, group=1),color='red')

How to reverse the order of two indices of a variable in R

I have a dataset that looks like
A T Value into T A Value
1 1 32 1 1 32
1 2 33 1 2 55
1 3 34 1 3 96
2 1 55 2 1 33
2 2 56 2 2 56
2 3 57 2 3 97
3 1 96 3 1 34
3 2 97 3 2 57
3 3 98 3 3 98
and i want to use reshape (in R) to reshape this object on the left so that the T index comes in the first column and the A index in the second column to get the object on the right. I dont have the melt or cast functions.
Let df be your data.frame.
df <- df[order(df$T, df$A), c("T", "A", "Value")]
This can be found out easily by googling next time.
Looks like you just want to sort rows and move columns. If this is your sample input
tt<-read.table(text="A T Value
1 1 32
1 2 33
1 3 34
2 1 55
2 2 56
2 3 57
3 1 96
3 2 97
3 3 98", header=T)
you can do
tt[order(tt$T, tt$A), c("T","A","Value")]

summing a range of columns in data frame

I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.
With this perhaps overly complex data frame:
site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)
my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)
I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.
the result would look like this
site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40
The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.
I know I can start off by deleting the column I don't need (SampleID)
my.data$SampleID <- NULL
but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!
This works ok:
x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)
library(dplyr)
my.data %>%
group_by(site) %>%
tally %>%
left_join(x)
site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 1 5 51 0 0 0 0 0 1 0 0
2 223 4 7 83 13 5 4 8 6 1 14 45
3 257 2 7 18 4 2 33 0 0 1 1 22
4 298 3 8 76 2 4 3 9 13 2 9 40
Or to do it all in dplyr
my.data %>%
group_by(site) %>%
tally %>%
left_join(my.data) %>%
group_by(site,moisture,shade,n) %>%
summarise_each(funs(sum=sum)) %>%
select(-sampleID)
site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 5 51 1 0 0 0 0 0 1 0 0
2 223 7 83 4 13 5 4 8 6 1 14 45
3 257 7 18 2 4 2 33 0 0 1 1 22
4 298 8 76 3 2 4 3 9 13 2 9 40
Try following using base R:
outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric())
my.data$basic = with(my.data, paste(site, moisture, shade))
for(b in unique(my.data$basic)) {
outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
for(i in 4:11)
outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}
outdf
site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 223 7 83 13 5 4 8 6 1 14 45
2 257 7 18 4 2 33 0 0 1 1 22
3 298 8 76 2 4 3 9 13 2 9 40
4 211 5 51 0 0 0 0 0 1 0 0

Resources