issue summing columns

issue summing columns - r

I have a very large dataset and I'm trying to get the sums of values. The variables are binary with 0s and 1s.
Somehow, when I run a for loop
for (i in 7:39){
agegroup1[53640, i]<-sum(agegroup1[, i])
}
The loop processes but everything but the first column would contain nothing but just NA. I tried calling the values up and would see 0 and 1s, as well as checking the class (it returns "integer"). But when adding it all up, R does not work.
Any advice?

cs <- colSums(agegroup1[, 7:39])
will give you the vector of column sums without looping (at the R level).
If you have any missing values (NAs) in agegroup1[, 7:39] then you may want to add na.rm = TRUE to the colSums() call (or even your sum() call).
You don't say what agegroup1 is or how many rows it has etc, but to finalise what your loop is doing, you then need
agegroup1[53640, 7:39] <- cs
What was in agegroup1[53640, ] before you started adding the column sums? NA? If so that would explain some behaviour.
We do really need more detail though...

#Gavin Simpson provided a workable solution but alternatively you could use apply. This function allows you to apply a function to the row or column margin.
x <- cbind(x1=1, x2=c(1:8), y=runif(8))
# If you wanted to sum the rows of columns 2 and 3
apply(x[,2:3], 1, sum, na.rm=TRUE)
# If you want to sum the columns of columns 2 and 3
apply(x[,2:3], 2, sum, na.rm=TRUE)

Related

Summing over all previous rows in large column efficiently

I have a large data set (>100,000 rows) and would like to create a new column that sums all previous values of another column.
For a simulated data set test.data with 100,000 rows and 2 columns, I create the new vector that sums the contents of column 2 with:
sapply(1:100000, function(x) sum(test.data[1:x[1],2]))
I append this vector to the test.table later with cbind() This is too slow, however. Is there a faster way to accomplish this, or be able to reference the vector that sapply is making in sapply so I can just update the cumulative sum instead of performing the whole calc again?

Per my comment above it'll be faster if you do a direct assignment and use cumsum instead of sapply (cumsum was specifically built for what you want to do).
This should work:
test.data$sum <- cumsum(test.data[, 2])

Check if a column has more than one value

I have a dataframe in which I only want to run a function on if I know that in certain columns (say there are 11 columns and I want to know this on 4 of them) there is more than one value (e.g. they are not all 2).
Is there any specific function to find this out or would I have to loop through each of the columns and check?

We can use sapply to loop over the columns, get the unique elements in each column, check whether the length is greater than 1. It gives a logical vector which can be used for subsetting the dataset if needed.
i1 <- sapply(df1, function(x) length(unique(x)) >1)
df1[i1]
Or another option to subset columns will be filter
Filter(var, df1)

For each column run length(unique(x)). This will print the number of unique columns. If you provide more information this can be nested into a function that decides whether or not to run based on the sums of length(unique(x)).

Omitting NA in specific rows when analyzing data from 2 columns of a very large dataframe

I am very new to R and I am struggling to understand how to omit NA values in a specific way.
I have a large dataframe with several columns (up to 40) and rows (up to 200ish). I want to use data from one of the columns to do simple stats (wilcox.test, boxplot, etc): one column will have a continuous variable (V1), while the other has a binary variable (V2; 0 or 1), which divides 2 groups. I want to do this for the continuous variable using different V2 binary variables, which are unrelated. I organized this data in Excel, saved it as CSV and am using R Studio.
All these columns have interspersed NA values and when I use omit.na, it just takes off every single row where a NA value is present, which takes away an awful load of data. Is there any simple solution to do this? I have seen some answers to similar topics, but none seems quite exactly what I need to do.
Many thanks for any answer. Again, I am a baby-level newbie to R and may have overlooked something in other topics!

If I understand, you want to apply to function to a pair of column each time.
wilcox.test(V1,V2)
wilcox.test(V1,V3)...
Where Vi have no missing values. I would do something like this :
## use complete.cases to assert that you have no missing values
## for the selected pair
apply_clean <-
function(x,y){
ok <- complete.cases(x, y)
wilcox.test(x[ok],dat$V1[ok])
}
## apply this function to all columns after removing the continuous column
lapply(subset(dat,select=-V1),apply_clean,y=dat$V1)

You can manipulate the data.frame to omit based on any rules you like. For example:
dirty.frame <- data.frame(col1 = c(1,2,3,4,5,6,7,NA,9,10), col2 = c(10, 9, 8, 7,6,5,4,3,2,1))
cleaned.frame <- dirty.frame[!is.na(dirty.frame$col1),]
This code used is.na() to test if a row in a specific column is na. The ! means not, and will omit that row.

Select columns with specific value in R

I have the following problem within R:
I'm working with a huge matrix. Some of the columns contain the value 'zero', which leads to problems during my further work.
Hence, I want to identify the columns, which contain at least one value of 'zero'.
Any ideas how to do it?

If you have a big matrix then this would be probably faster than an apply solution:
mat[,colSums(mat==0)<0.5]

lets say your matrix is called x,
x = matrix(runif(300), nrow=10)
to get the indices of the columns that have at least 1 zero:
ix = apply(x, MARGIN=2, function(col){any(col==0)})

Two data formatting questions for R

I have two questions, both are pretty simple I believe dealing with R.
I would like to create a IF statement that will assign a NA value to certain rows in a column. I have tried the following command:
a[a[,21]==0,5:10] <-NA
the error says:
Error in [<-.data.frame(tmp, a[, 21] == 0, 5:20, value = NA) : missing values are not allowed in subscripted assignments of data frames
Essentially that code is supposed to take any 0 value in column 21, and replace the values for that row from columns 5 to 10 to NA. There are NA's in column 21 already, but I am not sure whether that does anything?
I am not sure how to craft this next function at all. I need to manipulate data that contains positive and negative controls. However, when I manipulate the data, I don't want the positive and negative control values to be apart of the manipulation, but I want the positive and negative controls to remain in the columns because I have to use them later. Is there anyway to temporarily ignore these values so they aren't included in the manipulation?
Here sample data:
L = c(2,1,4,3,1,4,2,4,5,1)
R = c(2,4,5,1,"Neg",2,"",1,2,1)
T = c(2,1,4,2,"CTRL",2,"PCTRL",2,1,4)
test <- data.frame(L=L,R=R,T=T)
I would like to be able to temporarily ignore these rows based on the characters "Neg" "CTRL"/"" "PCTRL" rather than the position of them in the data frame if possible. Notice how for negative control, Neg and CTRL are in separate columns, same row, just like positive control where there is a blank and PCTRL in separate columns yet same rows. Any way to do this given these odd conditions?
Hope this was written clearly enough, and I thank anyone in advance for taking the time to help me!

Try this for subsetting your dataframe to those rows where R is not "Neg":
subset(test, R!="Neg")
For the NA problem, you probably already have NAs in your data frame, right? Try if this works:
a[a[,21] %in% 0, 5:10] <- NA

Try instead:
a[ which(a[,21]==0), 5:10] <-NA
Explanation: the == operation is returning NA values and the [<- function doesn't accept them. The which function will return a numeric vector and "throw away the NA's". As an aside, the [ function (without the '<-') will return all NA rows. This is considered a 'feature', but I find it to be an 'annoyance', so I will typically use which for selection as well as for selective-assignment.

For the first problem: if a[,21] is negative, do you want to assign NA? In this case,
a[replace(a[,21],is.na(a[,21]),0)==0,5:10] <- NA
Otherwise (note that I replaced replacement value of "0" with something nonzero ("1" used here but doesn't really matter as long as it's not zero),
a[replace(a[,21],is.na(a[,21]),1)==0,5:10] <- NA
As for the second problem,
subset(test,! (L %in% c("Neg","") | T %in% c("CTRL","PCTRL")))
In case the filtering conditions in L and T are not always coinciding. If they always coincide, then you can just apply test to one of L or T. Also, you may also want to keep in mind that T used to stand for TRUE in S, S-PLUS, and R (still does); you can reassign another value to T and things will be okay but I believe it's generally discouraged (same for c, which people also like to assign to).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

issue summing columns - r

Related

Summing over all previous rows in large column efficiently

Check if a column has more than one value

Omitting NA in specific rows when analyzing data from 2 columns of a very large dataframe

Select columns with specific value in R

Two data formatting questions for R

Categories

Resources