How to sum numeric values in a character vector? - r

I have a data frame with rows containing several numbers in each row. The data type of each column in the data frame is factor.
My data frame looks like
> df
C1
1 1, 14, 1, 4
2 2
3 NA
4 3, 7, 5
Now I want to sum up the values of each element in df. Such that I get
Sum
1 20
2 2
3 NA
4 15
I tried strsplit(as.character(df$C1),split=","). However I have no idea how to build get the sum...
df <- data.frame(C1= c("1, 14, 1, 4", "2", NA, "3, 7, 5"))

Using your code:
sapply(strsplit(as.character(df$C1), split=","),
function(x) sum(as.numeric(x)))
#[1] 20 2 NA 15
Or
library(splitstackshape)
Sum1 <- rowSums(cSplit(df, 'C1', sep=','), na.rm=TRUE)
#Assuming that there is only one column
Sum1[!Sum1] <- NA
Sum1
#[1] 20 2 NA 15
Or may be this also
unname(sapply(gsub(",", "+", df$C1),
function(x) eval(parse(text=x))))
# [1] 20 2 NA 15

Converting your data to a data.frame and using rowSums:
rowSums(read.table(text=as.character(df$C1),sep=',',fill=TRUE),na.rm=TRUE)
[1] 20 2 0 15

A mostly for fun approach using eval parse:
unlist(lapply(paste0("c(", as.character(df$C1), ")"), function(x) sum(eval(parse(text=x)))))
## [1] 20 2 NA 15

Related

Perform calculations on row depending on individual cells [duplicate]

This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 2 years ago.
I have a data frame in R that looks like
1 3 NULL,
2 NULL 5,
NULL NULL 9
I want to iterate through each row and perform and add the two numbers that are present. If there aren't two numbers present I want to throw an error. How do I refer to specific rows and cells in R? To iterate through the rows I have a for loop. Sorry not sure how to format a matrix above.
for(i in 1:nrow(df))
Data:
df <- data.frame(
v1 = c(1, 2, NA),
v2 = c(3, NA, NA),
v3 = c(NA, 5, 9)
)
Use rowSums:
df$sum <- rowSums(df, na.rm = T)
Result:
df
v1 v2 v3 sum
1 1 3 NA 4
2 2 NA 5 7
3 NA NA 9 9
If you do need a for loop:
for(i in 1:nrow(df)){
df$sum[i] <- rowSums(df[i,], na.rm = T)
}
If you have something with NULL you can make it a data.frame, but that will make the columns with NULL a character vector. You have to convert those to numeric, which will then introduce NA for NULL.
rowSums will then create the sum you want.
df <- read.table(text=
"
a b c
1 3 NULL
2 NULL 5
NULL NULL 9
", header =T)
# make columns numeric, this will change the NULL to NA
df <- data.frame(lapply(df, as.numeric))
cbind(df, sum=rowSums(df, na.rm = T))
# a b c sum
# 1 1 3 NA 4
# 2 2 NA 5 7
# 3 NA NA 9 9

Automate replacement of missing data on a sequence of variables using mutate_all

I am trying to automate a process to complete missing values on a sequence of variables using an ifelse statement and mutate_all function. The problem involves a dataframe with many variable names, for example, ax1, bx1, ...zx1, ax2, bx2, ...zx2, ax3, bx3, ...zx3. The following data give a small scenario:
df<-data.frame(
"id" = c(1:5),
"ax1" = c(1, "NA", 8, "NA", 17),
"bx1" = c(2, 7, "NA", 11, 12),
"ax2" = c(2, 1, 8, 15, 17),
"bx2" = c(2, 6, 4, 13, 11))
The process is to replace the missing values on the variables with the ending "x1" with their corresponding values on the variables with the ending "x2". That is, if ax1 is missing it is replaced by ax2 and any missingness on bx1 is replaced by bx2 and so on. Since there are many variables than the scenario presented here, I am looking for a way to automate this process. I have tried the following codes
library(dplyr)
df <- df %>%
mutate_all(vars(ends_with("x1", "x2")), function(x,y)
ifelse(is.na(x), y, x)))
but it does not work. I greatly appreciate any help on this.
The expected output is
id ax1 bx1 ax2 bx2
1 1 2 2 2
2 1 7 1 6
3 8 4 8 4
4 15 11 15 13
5 17 12 17 11
In base R, we can replace NA value in x1 with corresponding NA values in x2 using Map.
x1_cols <- grep('x1$', names(df))
x2_cols <- grep('x2$', names(df))
df[x1_cols] <- Map(function(x, y) {x[is.na(x)] <- y[is.na(x)];x},
df[x1_cols], df[x2_cols])
df
# id ax1 bx1 ax2 bx2
#1 1 1 2 2 2
#2 2 1 7 1 6
#3 3 8 4 8 4
#4 4 15 11 15 13
#5 5 17 12 17 11
We can use the same logic and use purrr::map2
df[x1_cols] <- purrr::map2(df[x1_cols], df[x2_cols],
~{.x[is.na(.x)] <- .y[is.na(.x)];.x})
data
Modified data a bit making sure that NA are actual NAs and not string "NA" which were actually making columns as factors.
df<-data.frame(id=c(1:5),
ax1=c(1,NA,8,NA,17),
bx1=c(2,7,NA,11,12),
ax2=c(2,1,8,15,17),
bx2=c(2,6,4,13,11))

Excluding both the minimum and maximum value

I want to exclude the minimum as well as the maximum value of each row in a data frame. (If one of those value are repeated, only one should be excluded.)
I can exclude either the minimum, or the maximum, but not both.
I don't seem to find a way to combine those (which both work fine by themselves):
d[-which(d == min(d))[1]]
d[-which(d == max(d))[1]]
This doesn't work:
d[
-which(d == min(d))[1] &
-which(d == max(d))[1]
]
It gives the full row.
(I also tried an approach using apply(d, 1, min/max), but this also fails.)
Update
Remembered after looking at #Rich Pauloo's answer, we can directly use which.max and which.min to get index of minimum and maximum value
as.data.frame(t(apply(df, 1, function(x) x[-c(which.max(x), which.min(x))])))
# V1 V2 V3
#1 13 11 6
#2 15 8 18
#3 5 10 21
#4 14 12 17
#5 19 9 20
Here which.max/which.min will ensure that you get the index of first minimum and maximum respectively for each row.
Some other variations could be
as.data.frame(t(apply(df, 1, function(x)
x[-c(which.max(x == min(x)), which.max(x == max(x)))])))
If you want to use which we can do
as.data.frame(t(apply(df, 1, function(x)
x[-c(which(x == min(x)[1]), which(x == max(x)[1]))])))
data
set.seed(1234)
df <- as.data.frame(matrix(sample(25), 5, 5))
df
# V1 V2 V3 V4 V5
#1 3 13 11 16 6
#2 15 1 8 25 18
#3 24 5 4 10 21
#4 14 12 17 2 22
#5 19 9 20 7 23
You were very close! With data.frames you need to use a comma within the brackets to accomplish row-column subsetting.
Use which.max() and which.min() to return the index of the max and min values of a vector, respectively.
Bind those indices into a new vector with c().
Use - and the vector from 2. to subset your data frame for the desired rows.
Here's an example to copy/paste:
d <- data.frame(a = 1:5) # make example data.frame
d[-c(which.max(d$a), which.min(d$a)), ]
[1] 2 3 4
This will remove the rows containing the min and max values of score as shown in the example data frame.
library(tidyverse)
df <- tribble(~name, ~score,
'John', 10,
'Mike', 2,
'Mary', 11,
'Jane', 1,
'Jill', 5)
df %>%
arrange(score) %>%
slice(-1, -nrow(.))
# A tibble: 3 x 2
name score
<chr> <dbl>
1 Mike 2
2 Jill 5
3 John 10
We can use
t(apply(df, 1, function(x) x[!x %in% range(x)]))

Is there a way to recode an SPSS function in R to create a single new variable?

Can somebody please help me with a recode from SPSS into R?
SPSS code:
RECODE variable1
(1,2=1)
(3 THRU 8 =2)
(9, 10 =3)
(ELSE = SYSMIS)
INTO variable2
I can create new variables with the different values. However, I'd like it to be in the same variable, as SPSS does.
Many thanks.
x <- y<- 1:20
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
y[x %in% (1:2)] <- 1
y[x %in% (3:8)] <- 2
y[x %in% (9:10)] <- 3
y[!(x %in% (1:10))] <- NA
y
[1] 1 1 2 2 2 2 2 2 3 3 NA NA NA NA NA NA NA NA NA NA
I wrote a function that has very similiar coding to the spss code recode. See here
variable1 <- -1:11
recodeR(variable1, c(1, 2, 1), c(3:8, 2), c(9, 10, 3), else_do= "missing")
NA NA 1 1 2 2 2 2 2 2 3 3 NA
This function now works also for other examples. This is how the function is defined
recodeR <- function(vec_in, ..., else_do){
l <- list(...)
# extract the "from" values
from_vec <- unlist(lapply(l, function(x) x[1:(length(x)-1)]))
# extract the "to" values
to_vec <- unlist(lapply(l, function(x) rep(x[length(x)], length(x)-1)))
# plyr is required for mapvalues
require(plyr)
# recode the variable
vec_out <- mapvalues(vec_in, from_vec, to_vec)
# if "missing" is written then all outside the defined range will be missings.
# Otherwise values outside the defined range stay the same
if(else_do == "missing"){
vec_out <- ifelse(vec_in < min(from_vec, na.rm=T) | vec_in > max(from_vec, na.rm=T), NA, vec_out)
}
# return resulting vector
return(vec_out)}

Calculating moving differences across columns per row in r

I would like to do calculations across columns in my data, by row. The calculations are "moving" in that I would like to know the difference between two numbers in column 1 and 2, then columns 3 and 4, and so on. I have looked at "loops" and "rollapply" functions, but could not figure this out. Below are three options of what was attempted. Only the third option gives me the result I am after, but it is very lengthy code and also does not allow for automation (the input data will be a much larger matrix, so typing out the calculation for each row won't work).
Please advice how to make this code shorter and/or any other packages/functions to check out which will do the job. THANK YOU!
MY TEST SCRIPT IN R + errors/results
Sample data set
a<- c(1,2,3, 4, 5)
b<- c(1,2,3, 4, 5)
c<- c(1,2,3, 4, 5)
test.data <- data.frame(cbind(a,b*2,c*10))
names(test.data) <- c("a", "b", "c")
Sample of calculations attempted:
OPTION 1
require(zoo)
rollapply(test.data, 2, diff, fill = NA, align = "right", by.column=FALSE)
RESULT 1 (not what we're after. What we need is at the bottom of Option 3)
# a b c
#[1,] NA NA NA
#[2,] 1 2 10
#[3,] 1 2 10
#[4,] 1 2 10
#[5,] 1 2 10
OPTION 2:
results <- for (i in 1:length(nrow(test.data))) {
diff(as.numeric(test.data[i,]), lag=1)
print(results)}
RESULT 2: (again not what we're after)
# NULL
OPTION 3: works, but long way, so would like to simplify code and make generic for any length of observations in my dataframe and any number of columns (i.e. more than 3). I would like to "automate" the steps below, if know number of observations (i.e. rows).
row1=diff(as.numeric(test[1,], lag=1))
row2=diff(as.numeric(test[2,], lag=1))
row3=diff(as.numeric(test[3,], lag=1))
row4=diff(as.numeric(test[4,], lag=1))
row5=diff(as.numeric(test[5,], lag=1))
results.OK=cbind.data.frame(row1, row2, row3, row4, row5)
transpose.results.OK=data.frame(t(as.matrix(results.OK)))
names(transpose.results.OK)=c("diff.ab", "diff.bc")
Final.data = transpose.results.OK
print(Final.data)
RESULT 3: (THIS IS WHAT I WOULD LIKE TO GET, "row1" can be "obs1" etc)
# diff.ab diff.bc
#row1 1 8
#row2 2 16
#row3 3 24
#row4 4 32
#row5 5 40
THE END
Here are the 3 options redone plus a 4th option:
# 1
library(zoo)
d <- t(rollapplyr(t(test.data), 2, diff, by.column = FALSE))
# 2
d <- test.data[-1]
for (i in 1:nrow(test.data)) d[i, ] <- diff(unlist(test.data[i, ]))
# 3
d <- t(diff(t(test.data)))
# 4 - also this works
nc <- ncol(test.data)
d <- test.data[-1] - test.data[-nc]
For any of them to set the names:
colnames(d) <- paste0("diff.", head(names(test.data), -1), colnames(d))
(2) and (4) give this data.frame and (1) and (3) give the corresponding matrix:
> d
diff.ab diff.bc
1 1 8
2 2 16
3 3 24
4 4 32
5 5 40
Use as.matrix or as.data.frame if you want the other.
An apply based solution using diff on row-wise can be achieved as:
# Result
res <- t(apply(test.data, 1, diff)) #One can change it to data.frame
# Name of the columns
colnames(res) <- paste0("diff.", head(names(test.data), -1),
tail(names(test.data), -1))
res
# diff.ab diff.bc
# [1,] 1 8
# [2,] 2 16
# [3,] 3 24
# [4,] 4 32
# [5,] 5 40

Resources