Convert list df (with multiple columns) to numeric - r

I have a df below:
view(fds)
#1 #2 #3 #4
1# 1 3 4 2
2# 4 5 3 2
3# 2 5 3 1
4# 3 5 1 3
I want to fds.sum <- rowSums(fds) but I get an "Error in rowSums(fds) : 'x' must be numeric"... Then, when I try fds.mun <- as.numeric(fds), I get an "Error: 'list' object cannot be coerced to type 'double'"...
I have tried fds.num <- lapply(fds, as.numeric) but that gives me:
fds.num list[4] List of Length 4
1# double[101] 1 4 2 3
2# double[101] 3 5 5 5
3# double[101] 4 3 3 1
4# double[101] 2 2 1 3
I just want a sum of my rows in a new column such that:
#1 #2 #3 #4 sum
1# 1 3 4 2 10
2# 4 5 3 2 14
3# 2 5 3 1 11
4# 3 5 1 3 12
Anyone know how to do that?

If we want to use the OP's code, just Reduce with +
fds$sum <- Reduce(`+`, lapply(fds, as.numeric) )
Or after converting to numeric, bind them as a matrix or update the original data
fds[] <- lapply(fds, as.numeric)
fds$sum <- rowSums(fds, na.rm = TRUE)
Or it can be done on the fly with sapply
fds$sum <- rowSums(sapply(fds, as.numeric))
Or even without doing as.numeric, can be automated with type.convert
fds$sum <- rowSums(type.convert(fds, as.is = TRUE))
The error showed in OP's code is a a result of applying rowSums directly on a list as lapply always returns a list

Related

How to convert factor column in df to numeric strings per row?

I am using R for a research project that requires me to input a sequence of 1-5 of varying length and then calculate a score from that sequence.
The data frame I have stores the sequences as a factor. If I take a single entry and convert it to a numeric vector, I can input it into the formula. But if I try to do this for all rows I run into errors.
I have searched SO and other sources but only found information on how to convert factors to numeric if they contain one value per cell. My data contains a sequence of numbers per cell separated by commas.
If I take input from one cell and use as.numeric(strsplit(as.character it works. But I don't want to do all cells manually. How can I solve this?
This is what I did:
df <- read.csv2("example_seq_logs.csv", na.strings = "n/a")
df$seqtext <- as.character(df$hmm)
This is what the data frame looks like:
head(df)
lesson hmm
1 A 1,2,3,3,3,4,3,4,5,4,4,5,5,2,2,1,2,3,4,2,3
2 B 2,2,3,4,1,1,3,3,3,5,5,4,4,4,2,1
3 C 1,3,1,3,2,3,2,2,3,3,4,1,3,2,3,3,5,4,4,3,3
4 D 1,3,2,2,3,3,2,3,1,4,4,5,5,2,4,4,4,3
5 E 1,4,2,5,1,3,1,3,1,4,3,4,4
str(df)
'data.frame': 5 obs. of 2 variables:
$ lesson: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ hmm : Factor w/ 5 levels "1,2,3,3,3,4,3,4,5,4,4,5,5,2,2,1,2,3,4,2,3",..: 1 5 2 3 4
sapply(df, mode)
lesson hmm
"numeric" "numeric"
Now if I take a single entry I can do this:
testseq <- as.numeric(strsplit(df$seqtext)[1],",")[[1]])
str(testseq)
num [1:21] 1 2 3 3 3 4 3 4 5 4 ...
and then I can input the testseq sequence into the function I need.
But when I try the same for the whole column it results in an error
df$seq <- as.numeric(strsplit(df$seqtext, ","))[[1:58]]
Error: (list) object cannot be coerced to type 'double'
Thank you for your help!
Edit:
The first suggestion yields this error:
df$seq <- as.numeric(unlist(strsplit(paste(df$seqtext, collapse = ","), ",")))
Error in `$<-.data.frame`(`*tmp*`, seq, value = c(1, 2, 3, 3, 3, 4, 3, :
replacement has 89 rows, data has 5
It seems it turns the entire column into one long string.
a <- as.numeric(unlist(strsplit(paste(df$seqtext, collapse = ","), ",")))
print(a)
[1] 1 2 3 3 3 4 3 4 5 4 4 5 5 2 2 1 2 3 4 2 3 2 2 3 4 1 1 3 3 3 5 5 4 4 4 2 1 1 3 1 3 2 3 2 2 3 3 4 1 3 2 3
[53] 3 5 4 4 3 3 1 3 2 2 3 3 2 3 1 4 4 5 5 2 4 4 4 3 1 4 2 5 1 3 1 3 1 4 3 4 4
But I need each sequence to turn up in the right row as a string.
Edit:
I found that the function I need to calculate results with doesn't need numerics so now I've solved the issue using a for loop:
df$score <- 0
for (i in 1:nrow(df)) {
seq <- as.array(strsplit(as.character(df$hmm),","))
session_seq <- seq[i]
res = computehmm(session_seq)
df$score[i] <- res$score
}
But now it stops calculating once it reaches an empty df$hmm field.
I understand sapply would be better but I don't understand how to get it to work.
You can use paste as:
as.numeric(unlist(strsplit(paste(df$seqtext, collapse = ","), ",")))

Recoding specific column values using reference list

My dataframe looks like this
data = data.frame(ID=c(1,2,3,4,5,6,7,8,9,10),
Gender=c('Male','Female','Female','Female','Male','Female','Male','Male','Female','Female'))
And I have a reference list that looks like this -
ref=list(Male=1,Female=2)
I'd like to replace values in the Gender column using this reference list, without adding a new column to my dataframe.
Here's my attempt
do.call(dplyr::recode, c(list(data), ref))
Which gives me the following error -
no applicable method for 'recode' applied to an object of class
"data.frame"
Any inputs would be greatly appreciated
An option would be do a left_join after stacking the 'ref' list to a two column data.frame
library(dplyr)
left_join(data, stack(ref), by = c('Gender' = 'ind')) %>%
select(ID, Gender = values)
A base R approach would be
unname(unlist(ref)[as.character(data$Gender)])
#[1] 1 2 2 2 1 2 1 1 2 2
In base R:
data$Gender = sapply(data$Gender, function(x) ref[[x]])
You can use factor, i.e.
factor(data$Gender, levels = names(ref), labels = ref)
#[1] 1 2 2 2 1 2 1 1 2 2
You can unlist ref to give you a named vector of codes, and then index this with your data:
transform(data,Gender=unlist(ref)[as.character(Gender)])
ID Gender
1 1 1
2 2 2
3 3 2
4 4 2
5 5 1
6 6 2
7 7 1
8 8 1
9 9 2
10 10 2
Surprisingly, that one works as well:
data$Gender <- ref[as.character(data$Gender)]
#> data
# ID Gender
# 1 1 1
# 2 2 2
# 3 3 2
# 4 4 2
# 5 5 1
# 6 6 2
# 7 7 1
# 8 8 1
# 9 9 2
# 10 10 2

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

Identifying unique duplicates in vector in R

I am trying to identify duplicates based of a match of elements in two vectors. Using duplicate() provides a vector of all matches, however I would like to index which are matches with each other or not. Using the following code as an example:
x <- c(1,6,4,6,4,4)
y <- c(3,2,5,2,5,5)
frame <- data.frame(x,y)
matches <- duplicated(frame) | duplicated(frame, fromLast = TRUE)
matches
[1] FALSE TRUE TRUE TRUE TRUE TRUE
Ultimately, I would like to create a vector that identifies elements 2 and 4 are matches as well as 3,5,6. Any thoughts are greatly appreciated.
Another data.table answer, using the group counter .GRP to assign every distinct element a label:
d <- data.table(frame)
d[,z := .GRP, by = list(x,y)]
# x y z
# 1: 1 3 1
# 2: 6 2 2
# 3: 4 5 3
# 4: 6 2 2
# 5: 4 5 3
# 6: 4 5 3
How about this with plyr::ddply()
ddply(cbind(index=1:nrow(frame),frame),.(x,y),summarise,count=length(index),elems=paste0(index,collapse=","))
x y count elems
1 1 3 1 1
2 4 5 3 3,5,6
3 6 2 2 2,4
NB = the expression cbind(index=1:nrow(frame),frame) just adds an element index to each row
Using merge against the unique possibilities for each row, you can get a result:
labls <- data.frame(unique(frame),num=1:nrow(unique(frame)))
result <- merge(transform(frame,row = 1:nrow(frame)),labls,by=c("x","y"))
result[order(result$row),]
# x y row num
#1 1 3 1 1
#5 6 2 2 2
#2 4 5 3 3
#6 6 2 4 2
#3 4 5 5 3
#4 4 5 6 3
The result$num vector gives the groups.

Resources