R sum multiple columns with multiple row - r

So i have this data
10 21 22 23 23 43
20 12 26 43 23 65
21 54 64 73 25 75
My expected outcome is:
142
189
312
I tried to use:
df = data.matrix(df)
df = colSums(df)
df = as.data.frame(df)
However, the sum of values are wrong. I would like to know how to improve or correct this solution?

We can use rowSums
rowSums(df)
#[1] 142 189 312

Your data is stored as factors. You must convert it to numeric using as.numeric(as.character()).
In your situation I suggest to do:
for(i in 1:nrow(df)){
df[i,]<-as.numeric(as.character(df[i,]))
}
rowSums(df)

Related

Select a range of rows from every n rows from a data frame

I have 2880 observations in my data.frame. I have to create a new data.frame in which, I have to select rows from 25-77 from every 96 selected rows.
df.new = df[seq(25, nrow(df), 77), ] # extract from 25 to 77
The above code extracts only row number 25 to 77 but I want every row from 25 to 77 in every 96 rows.
One option is to create a vector of indeces with which subset the dataframe.
idx <- rep(25:77, times = nrow(df)/96) + 96*rep(0:29, each = 77-25+1)
df[idx, ]
You can use recycling technique to extract these rows :
from = 25
to = 77
n = 96
df.new <- df[rep(c(FALSE, TRUE, FALSE), c(from - 1, to - from + 1, n - to))), ]
To explain for this example it will work as :
length(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19))) #returns
#[1] 96
In these 96 values, value 25-77 are TRUE and rest of them are FALSE which we can verify by :
which(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19)))
# [1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#[23] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#[45] 69 70 71 72 73 74 75 76 77
Now this vector is recycled for all the remaining rows in the dataframe.
First, define a Group variable, with values 1 to 30, each value repeating 96 times. Then define RowWithinGroup and filter as required. Finally, undo the changes introduced to do the filtering.
df <- tibble(X=rnorm(2880)) %>%
add_column(Group=rep(1:96, each=30)) %>%
group_by(Group) %>%
mutate(RowWithinGroup=row_number()) %>%
filter(RowWithinGroup >= 25 & RowWithinGroup <= 77) %>%
select(-Group, -RowWithinGroup) %>%
ungroup()
Welcome to SO. This question may not have been asked in this exact form before, but the proinciples required have been rerefenced in many, many questions and answers,
A one-liner base solution.
lapply(split(df, cut(1:nrow(df), nrow(df)/96, F)), `[`, 25:77, )
Note: Nothing after the last comma
The code above returns a list. To combine all data together, just pass the result above into
do.call(rbind, ...)

R: many nested loops to remove rows in multiple data frames

I have 18 data frames called regular55, regular56, regular57, collar55, collar56, etc. In each data frame, I want to delete the first row of each nest.
Each data frame looks like this:
nest interval
1 17 -8005
2 17 183
3 17 186
4 17 221
5 17 141
6 17 30
7 17 158
8 17 23
9 17 199
10 17 51
11 17 169
12 17 176
13 31 905
14 31 478
15 31 40
16 31 488
17 31 16
18 31 203
19 31 54
20 31 341
21 31 54
22 50 -14164
23 50 98
24 50 1438
25 71 240
26 71 725
27 71 819
28 85 -13935
29 85 45
30 85 589
31 85 47
32 85 161
33 85 67
The solution I came up with to avoid writing out the function for each one of the 18 data frames includes many nested loops:
for (i in 5:7){
for (j in 5:7) {
for (k in c("regular","collar")){
for (l in c(unique(paste0(k,i,j,"$nest")))){
paste0(k,i,j)=paste0(k,i,j)[(-c(which((paste0(k,i,j,"$nest")) == l )
[1])),]
}}}}
I'm basically selecting the first value at "which" there is a "unique" value of nest. However, I get:
Error in paste0(k, i, j)[(-c(which((paste0(k, i, j, "$nest")) == l)[1])), :
incorrect number of dimensions
It might be because "paste0(k,i,j)" is only considered as a character and not recognized as the name for a data frame.
Any ideas on how to fix this? Or any other ways to delete the first rows for each nest in every data frame?
Thanks to help from the comments, my problem was solved.
Originally, I divided my data frame using a for loop and then grouped it into one list:
for (i in 5:7) {
for (j in 5:7) {
for (k in c("regular","collar")){
assign(paste0(k,i,j),
df[df$x == i & df$y == j & df$z == k,])
}}}
df.list=mget(ls(pattern=("[regular,collar][5-7][5-7]")))
I later found a way to split my data frame directly into a list based on multiple columns (R subsetting a data frame into multiple data frames based on multiple column values):
df.list= split(df, with(df, interaction(df$x, df$y, df$z)), drop = TRUE)
Finally, I was able to apply the function to remove the first rows of each nest:
df.list.updated = lapply(df.list, function(d) d %>% group_by(nest) %>%
slice(2:n()))
It is definitely easier to work from a list of data frames.

use dplyr mutate() in programming

I am trying to assign a column name to a variable using mutate.
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df%>%mutate(name = ifelse(x <50, "small", "big"))
}
When I run
new(name = "newVar")
it doesn't work. I know mutate_() could help but I'm struggling in using it together with ifelse.
Any help would be appreciated.
Using dplyr 0.7.1 and its advances in NSE, you have to UQ the argument to mutate and then use := when assigning. There is lots of info on programming with dplyr and NSE here: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I've changed the name of the function argument to myvar to avoid confusion. You could also use case_when from dplyr instead of ifelse if you have more categories to recode.
df <- data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(myvar){
df %>% mutate(UQ(myvar) := ifelse(x < 50, "small", "big"))
}
new(myvar = "newVar")
This returns
x y newVar
1 37 1.82669 small
2 63 -0.04333 big
3 46 0.20748 small
4 93 0.94169 big
5 83 -0.15678 big
6 14 -1.43567 small
7 61 0.35173 big
8 26 -0.71826 small
9 21 1.09237 small
10 90 1.99185 big
11 60 -1.01408 big
12 70 0.87534 big
13 55 0.85325 big
14 38 1.70972 small
15 6 0.74836 small
16 23 -0.08528 small
17 27 2.02613 small
18 76 -0.45648 big
19 97 1.20124 big
20 99 -0.34930 big
21 74 1.77341 big
22 72 -0.32862 big
23 64 -0.07994 big
24 53 -0.40116 big
25 16 -0.70226 small
26 8 0.78965 small
27 34 0.01871 small
28 24 1.95154 small
29 82 -0.70616 big
30 77 -0.40387 big
31 43 -0.88383 small
32 88 -0.21862 big
33 45 0.53409 small
34 29 -2.29234 small
35 54 1.00730 big
36 22 -0.62636 small
37 100 0.75193 big
38 52 -0.41389 big
39 36 0.19817 small
40 89 -0.49224 big
41 81 -1.51998 big
42 18 0.57047 small
43 78 -0.44445 big
44 49 -0.08845 small
45 20 0.14014 small
46 32 0.48094 small
47 1 -0.12224 small
48 66 0.48769 big
49 11 -0.49005 small
50 87 -0.25517 big
Following the dlyr programming vignette, define your function as follows:
new <- function(name)
{
nn <- enquo(name) %>% quo_name()
df %>% mutate( !!nn := ifelse(x <50, "small", "big"))
}
enquo takes its expression argument and quotes it, followed by quo_name converting it into a string. Since nn is now quoted, we need to tell mutate not to quote it a second time. That's what !! is for. Finally, := is a helper operator to make it valid R code. Note that with this definition, you can simply pass newVar instead of "newVar" to your function, maintaining dplyr style.
> new( newVar ) %>% head
x y newVar
1 94 -1.07642088 big
2 85 0.68746266 big
3 80 0.02630903 big
4 74 0.18323506 big
5 86 0.85086915 big
6 38 0.41882858 small
Base R solution
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df[,name]='s'
df[,name][df$x>50]='b'
return(df)
}
I am using dplyr 0.5 so i just combine base R with mutate
new <- function(Name){
df=mutate(df,ifelse(x <50, "small", "big"))
names(df)[3]=Name
return(df)
}
new("newVar")

Cumulative count of values in R

I hope you are doing very well. I would like to know how to calculate the cumulative sum of a data set with certain conditions. A simplified version of my data set would look like:
t id
A 22
A 22
R 22
A 41
A 98
A 98
A 98
R 98
A 46
A 46
R 46
A 46
A 46
A 46
R 46
A 46
A 12
R 54
A 66
R 13
A 13
A 13
A 13
A 13
R 13
A 13
Would like to make a new data set where, for each value of "id", I would have the cumulative number of times that each id appears , but when t=R I need to restart the counting e.g.
t id count
A 22 1
A 22 2
R 22 0
A 41 1
A 98 1
A 98 2
A 98 3
R 98 0
A 46 1
A 46 2
R 46 0
A 46 1
A 46 2
A 46 3
R 46 0
A 46 1
A 12 1
R 54 0
A 66 1
R 13 0
A 13 1
A 13 2
A 13 3
A 13 4
R 13 0
A 13 1
Any ideas as to how to do this? Thanks in advance.
Using rle:
out <- transform(df, count = sequence(rle(do.call(paste, df))$lengths))
out$count[out$t == "R"] <- 0
If your data.frame has more than these two columns, and you want to check only these two columns, then, just replace df with df[, 1:2] (or) df[, c("t", "id")].
If you find do.call(paste, df) dangerous (as #flodel comments), then you can replace that with:
as.character(interaction(df))
I personally don't find anything dangerous or clumsy with this setup (as long as you have the right separator, meaning you know your data well). However, if you do find it as such, the second solution may help you.
Update:
For those who don't like using do.call(paste, df) or as.character(interaction(df)) (please see the comment exchanges between me, #flodel and #HongOoi), here's another base solution:
idx <- which(df$t == "R")
ww <- NULL
if (length(idx) > 0) {
ww <- c(min(idx), diff(idx), nrow(df)-max(idx))
df <- transform(df, count = ave(id, rep(seq_along(ww), ww),
FUN=function(y) sequence(rle(y)$lengths)))
df$count[idx] <- 0
} else {
df$count <- seq_len(nrow(df))
}

How to return back the imputed values in R

Is there any function in R that can help to return imputed values, for example:
x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,
23)
by using single linear imputation method,
na.approx(x)
I get the imputed data as;
[1] 23 23 25 43 34 22 78 35 98 23 30 24 21 78 22 76 22 77 33 98 22 14 52 87 59
[26] 23 23
How can I get the imputed value from the program back without looking at the completed dataset one by one? For example, if the data I imputed contain $n=200$ observations, can I get 20 estimates of the missing value?
I am not 100 percent sure if I got you right, but does this help?
You first save the places, at which the original NA values are, so.e.g the first NA value is at the 8th place. Save this into the dummy variable
dummy<-NA
for (i in 1:length(x)){
if(is.na(x[i])) dummy[i]<-i
}
Now get the corresponding values in the imputed data
imputeddata<-na.approx(x)
for (i in 1:length(imputeddata)){
if(!is.na(imputeddata[dummy[i]])) print(imputeddata[dummy[i]])
}
You could use is.na to select only those values that were previously NA.
> x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,23)
> na.approx(x)[is.na(x)]
[1] 88.0 25.5 76.5 37.0 55.0
Hope that helps.

Resources