summing a range of columns in data frame - r
I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.
With this perhaps overly complex data frame:
site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)
my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)
I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.
the result would look like this
site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40
The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.
I know I can start off by deleting the column I don't need (SampleID)
my.data$SampleID <- NULL
but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!
This works ok:
x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)
library(dplyr)
my.data %>%
group_by(site) %>%
tally %>%
left_join(x)
site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 1 5 51 0 0 0 0 0 1 0 0
2 223 4 7 83 13 5 4 8 6 1 14 45
3 257 2 7 18 4 2 33 0 0 1 1 22
4 298 3 8 76 2 4 3 9 13 2 9 40
Or to do it all in dplyr
my.data %>%
group_by(site) %>%
tally %>%
left_join(my.data) %>%
group_by(site,moisture,shade,n) %>%
summarise_each(funs(sum=sum)) %>%
select(-sampleID)
site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 5 51 1 0 0 0 0 0 1 0 0
2 223 7 83 4 13 5 4 8 6 1 14 45
3 257 7 18 2 4 2 33 0 0 1 1 22
4 298 8 76 3 2 4 3 9 13 2 9 40
Try following using base R:
outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric())
my.data$basic = with(my.data, paste(site, moisture, shade))
for(b in unique(my.data$basic)) {
outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
for(i in 4:11)
outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}
outdf
site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 223 7 83 13 5 4 8 6 1 14 45
2 257 7 18 4 2 33 0 0 1 1 22
3 298 8 76 2 4 3 9 13 2 9 40
4 211 5 51 0 0 0 0 0 1 0 0
Related
How to add two specific columns from a colSums table in r?
I made a frequency table with two variables in a data frame using this: table(df$Variable1, df$Variable2) The output was this: 1 2 3 4 5 D R 1 5000 21 39 2 10 0 112 2 1028 11 18 4 8 1 54 3 1501 6 12 2 3 0 68 4 355 2 4 0 0 0 23 5 421 4 4 0 0 0 49 Then I wanted to find the sum of the first two columns so I did this: colSums(table(df$Variable1, df$Variable2)) The output was this: 1 2 3 4 5 D R 8305 44 77 8 21 1 306 Is there a way to find the sum of columns 1 and 2 from the colSums output above? What would the code be? Thanks in advance.
Get the average of the values of one column for the values in another
I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives. My data looks like this > tabmean Initiative Topic Goals Achievements Tone 1 52 44 2 2 2 2 294 42 2 2 2 3 103 31 2 2 2 4 52 41 2 2 2 5 87 26 2 1 1 6 52 87 2 2 2 7 136 81 2 2 2 8 19 7 2 2 1 9 19 4 2 2 2 10 0 63 2 2 2 11 0 25 2 2 2 12 19 51 2 2 2 13 52 51 2 2 2 14 108 94 2 2 1 15 52 89 2 2 2 16 110 37 2 2 2 17 247 25 2 2 2 18 66 95 2 2 2 19 24 49 2 2 2 20 24 110 2 2 2 I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2). I have used this code: GoalMeanTone <- tabmean %>% group_by(Initiative,Topic,Goals,Tone) %>% summarize(averagetone = mean(Tone)) With Solution output : GoalMeanTone # A tibble: 454 x 5 # Groups: Initiative, Topic, Goals [424] Initiative Topic Goals Tone averagetone <chr> <chr> <chr> <chr> <dbl> 1 0 104 2 0 NA 2 0 105 2 0 NA 3 0 22 2 0 NA 4 0 25 2 0 NA 5 0 29 2 0 NA 6 0 30 2 1 NA 7 0 31 1 1 NA 8 0 42 1 0 NA 9 0 44 2 0 NA 10 0 44 NA 0 NA # ... with 444 more rows note that for Initiative Value 0 means "other initiative". and I've also tried this code library(plyr) GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) ) with solution output > GoalMeanTone2 Initiative V1 1 0 NA 2 1 NA 3 101 NA 4 102 NA 5 103 NA 6 104 NA 7 105 NA 8 107 NA 9 108 NA 10 110 NA Note that in both instances, I do not get an average for Tone but instead get NA's I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted). and I have also re-coded the values for Tone : tabmean<-Meantable %>% mutate(Tone=recode(Tone, `1`="1", `2`="0", `3`="-1", `4`="2")) I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this. i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following: tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE) Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).
Why is my R code for filtering data producing different results with "fread()" and "ffdf()"?
I have a huge file with 7 million records and 160 variables. I came to know that fread() and read.csv.ffdf() are two ways to handle such big data. But when I try to use dplyr to filter these two data sets, I get different results. Below is a small subset of my data- sample_data AGE AGE_NEONATE AMONTH AWEEKEND 2 18 5 0 3 32 11 0 4 67 7 0 5 37 6 1 6 57 5 0 7 50 6 0 8 59 12 0 9 44 9 0 10 40 9 0 11 27 3 0 12 59 8 0 13 44 7 0 14 81 10 0 15 59 6 1 16 32 10 0 17 90 12 1 18 69 7 0 19 62 11 1 20 85 6 1 21 43 10 0 Code1 sample_data <- fread("/user/sample_data.csv", stringsAsFactors = T) age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95)) Result1- AGE AGE_NEONATE AMONTH AWEEKEND 1 67 NA 7 0 2 81 NA 10 0 3 90 NA 12 1 4 69 NA 7 0 5 85 NA 6 1 Code2- sample_data <- read.csv.ffdf(file="C:/Users/sample_data.csv", header=F ,fill=T) header.true <- function(df) { names(df) <- as.character(unlist(df[1,])) df[-1,] } sample_data<-tbl_ffdf(sample_data) sample_data<-header.true(sample_data) age_filter<-sample_data%>%filter(!(is.na(AGE)), between(as.numeric(AGE),65 , 95)) Result2- AGE AGE_NEONATE AMONTH AWEEKEND 1 81 10 0 2 90 12 1 3 85 6 1 I know that my 1st code is correct and gives me the correct results. What am I doing wrong in the 2nd code?
I haven't really tried running your code, but from what I can see, I suspect the following: In your 2nd code version, you are reading the headers as part of the data. This leads to all the columns being imported as character rather than numeric. In addition, most likely you have default.stringsAsFactors() returning TRUE, meaning that the imported character columns are treated as factors. Now I guess that your between is being applied to factor levels between 65 and 95, rather than to the actual numbers. Since you probably don't have data for every year (age), 67 and 69 are likely mapped to factor levels below 65 (i.e. as.numeric(AGE) will return you the factor levels the numbers map to, and not the numbers as you see them when printing). Try to use stringsAsFactors = FALSE or convert explicitly to character after reading.
How to get value from upcomming row if condition is met?
I searched in google and SO but could not find any answer to my question. I try to get a value from the first upcomming row if the condition is met. Example: Pupil participation bonus 2 55 6 2 33 3 2 88 9 2 0 -100 2 44 4 2 66 7 2 0 -33 to Pupil participation bonus bonusAtNoParti sumBonusTillParticipation=0 2 55 6 -94 6+3+9 = 18 2 33 3 -97 3+9 = 12 2 88 9 -91 9 2 0 -100 0 0 2 44 4 -29 4+7=11 2 66 7 -26 7 2 0 -33 0 0 So I need to do this: Iterate through the dataframe and check next rows till participation equals to 0 and get the bonus from that line and add the bonus from the current line and write it to bonusAtNoPati. My problem here is the "check next rows till participation equals to 0 and get the bonus from that line" I know how to Iterate through the whole list but not after the current point(row) I would need to do this process to the whole list where i can get any random participation value in random order. Has anyone any idea how to realize it? Edit, I also added another column("sumBonusTillParticipation=0", only sum value is required) which is even harder to realize. R is such a hard to learn language =(
you can use which to get which row number participation is 0. df <- read.table(text = 'Pupil participation bonus 2 55 6 2 33 3 2 88 9 2 0 -100 2 44 4 2 66 7 2 0 -33', header = T) index <- c(0, which(df$participation == 0)) diffs <- diff(index) df$tp <- rep(df$bonus[index], times = diffs) df$bonusAtNoParti <- df$bonus + df$tp df$bonusAtNoParti[index] <- 0 df$tp <- NULL Pupil participation bonus bonusAtNoParti 1 2 55 6 -94 2 2 33 3 -97 3 2 88 9 -91 4 2 0 -100 0 5 2 44 4 -29 6 2 66 7 -26 7 2 0 -33 0
transform values in data frame, generate new values as 100 minus current value
I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss. However, I am now thinking that a nice plot would be a degradation curve. So, given the following example; >losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size") > full_losses_data Divisions Accuracy 20 15 10 5 2 1 0 0 0 0 3 25 2 0 0 0 1 10 39 3 0 0 1 3 17 48 4 0 0 1 5 23 55 5 0 1 3 8 29 60 6 0 1 4 11 34 64 7 0 2 5 13 38 67 8 0 3 7 16 42 70 9 0 4 9 19 45 72 10 0 5 11 22 48 74 Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column: fld <- full_losses_data fld[, 2:ncol(fld)] <- 100 - fld[, -1]