I have a list of data frames. Each data frame has 6 rows and 6 columns. They are all numbers, however, all data frames have their elements as class character.
Example:
$`A`
V1 V2 V3 V4 V5 V6
V1 0.1212 0.6231 0.4431 0.3213 0.6578 0.1259
V2 2.1234 0.6532 0.9845 0.8743 0.8732
V3 0.2314 0.7648 0.7634 0.8732
V4 0.1234 0.6544 0.3456
V5 0.7653 0.9812
V6 0.1265
$`B`
V1 V2 V3 V4 V5 V6
V1 0.2345 0.1234 0.5647 0.7891 0.6721 0.3259
V2 1.1334 0.4332 0.1245 0.2343 0.5332
V3 0.2914 0.1648 0.2334 0.1232
V4 0.1234 0.6744 0.5656
V5 0.3553 0.9812
V6 0.4665
I would like to change all data frames of the list to class matrix (numerical).
I tried:
lapply (list, data.matrix)
but the result is a list of data frames with integers. Example:
V1 V2 V3 V4 V5 V6
V1 2 2 2 2 2 4
V2 1 3 4 5 5 7
V3 1 1 3 4 6 3
V4 1 1 1 3 4 5
V5 1 1 1 1 1 1
V6 1 1 1 1 1 1
Also tried to run
lapply(list, as.matrix)
however, I got a list of quoted matrices, like this:
$`A`
V1 V2 V3 V4 V5 V6
V1 "0.1212" "0.6231" "0.4431" "0.3213" "0.6578" "0.1259"
V2 "2.1234" "0.6532" "0.9845" "0.8743" "0.8732"
V3 "0.2314" "0.7648" "0.7634" "0.8732"
V4 "0.1234" "0.6544" "0.3456"
V5 "0.7653" "0.9812"
V6 "0.1265"
How can I convert these data frames of my list from character class to matrix class?
We may loop over the list, then loop over the data.frame columns with lapply convert to numeric and assign it back to the original data.frame object and return the data.frame ('x')
list <- lapply(list, function(x) {x[] <- lapply(x, as.numeric);x})
If those are factor columns, convert to character first and then to numeric
lapply(list, function(x) {x[] <- lapply(x, function(y) as.numeric(as.character(y)))
x})
You can convert to numeric and then reset the matrix order:
lapply(dfs, function(x) matrix(as.numeric(x), ncol = n_cols))
Data
set.seed(1L)
n_cols <- 6
n_total <- 36
a <- matrix(rnorm(n_total), ncol = n_cols)
b <- matrix(rnorm(n_total), ncol = n_cols)
a[lower.tri(a)] <- ""
b[lower.tri(b)] <- ""
dfs <- list(a, b)
I have a very large dataframe of 63 columns and 1697 rows. The end of the rows fill up with NAs but I want the matching values in rows to be in the same column, and stick the NAs into the gaps
a bit like this (updated):
v1 <- c("v1","v1","v1","v1","v1","v1","v1")
v2 <- c("v3","v2","v2","NA","v2","v2","v2")
v3 <- c("v4","v4","v3","NA","v3","v3", "v3")
v4 <- c("v5","v5","v4","NA","v5","v4","NA")
v5 <- c("NA","NA","v5","NA","v6","v6", "NA")
v6 <- c("NA","NA","v6","NA","v7","v7","NA")
v7 < - c("NA","NA","NA","NA","NA","NA","NA")
df <- data.frame(v1,v2,v3,v4,v5,v6,v7)
df
v1 v2 v3 v4 v5 v6 v7
1 v1 v3 v4 v5 NA NA NA
2 v1 v2 v4 v5 NA NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 v5 v6 v7 NA
6 v1 v2 v3 v4 v6 v7 NA
7 v1 v2 v3 NA NA NA NA
but I would like everything aligned like this:
v1 v2 v3 v4 v5 v6 v7
1 v1 NA NA v4 v5 NA NA
2 v1 v2 NA v4 v5 NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 NA v5 v6 v7
6 v1 v2 v3 v4 NA v6 v7
7 v1 v2 v3 NA NA NA NA
I have tried map.values() and this didn't come out as expected, as well as a ifelse() but this all requires me to enter specific cell data and change that.
The column names do match the cell names.
I want to use the data to put into a presence absence plot, so I figured after I can just
for (i in 1:63){
gsub("NA", 0, df[,i]}
and then same for anything containing "v" to have a binary 1 or 0 for presence or absence, but they have to be aligned
There are no predefined rules governing the data, the dataframe has been conglomerated together from many other .csv files and this is the best format I can get it into currently.
Any help would be appreciated!
Updated answer to match new input data
Data
I removed the quotation marks from NA:
v1 <- c("v1","v1","v1","v1","v1","v1","v1")
v2 <- c("v3","v2","v2",NA,"v2","v2","v2")
v3 <- c("v4","v4","v3",NA,"v3","v3", "v3")
v4 <- c("v5","v5","v4",NA,"v5","v4",NA)
v5 <- c(NA,NA,"v5",NA,"v6","v6", NA)
v6 <- c(NA,NA,"v6",NA,"v7","v7",NA)
v7 <- c(NA,NA,NA,NA,NA,NA,NA)
df <- data.frame(v1,v2,v3,v4,v5,v6,v7, stringsAsFactors = F)
Code
l <- list()
u <- c("v1", "v2", "v3", "v4", "v5", "v6", "v7")
h <- NULL
for(k in 1:nrow(df)){
# create a list for each row of the df
l[[k]] <- df[k, ]
for(i in 1:length(l[[k]])){
#check if number exists in the row
if(u[i] %in% l[[k]]){
# find the index of the number given it exists
a <- which(l[[k]] == u[i])
#assign to "help" vector in order to not overwrite values
h[i] <- l[[k]][a]
}
else{
#numbers that do not exist in the vector are asigned NA
h[i] <- NA
}
}
#replace row by sorted vector with NA place holders ("help" vector)
l[[k]] <- h
}
Result
df1 <- as.data.frame(do.call(rbind, l))
df1
V1 V2 V3 V4 V5 V6 V7
1 v1 NA v3 v4 v5 NA NA
2 v1 v2 NA v4 v5 NA NA
3 v1 v2 v3 v4 v5 v6 NA
4 v1 NA NA NA NA NA NA
5 v1 v2 v3 NA v5 v6 v7
6 v1 v2 v3 v4 NA v6 v7
7 v1 v2 v3 NA NA NA NA
I am required to build a function which uses mean to replace missing values for continuous/integer variables and uses mode to replace missing values for categorical variables.
The data comes from credit screening dataset
X <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data", header = FALSE, na.strings = '?')
The first column of the dataset is of factor type, second and third columns are numeric.....
I built a mode function
mode_function <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Which works as intended.
The overall function that I am using on the dataset is
broken <- function(data){
for(i in 1:ncol(data)){
if(is.factor(data[,i])){
data[is.na(data[,i]),i] <- mode_function(data[,i])
}
else{
data[is.na(data[,i]),i] <- mean(data[,i], na.rm = TRUE)
}
}
return(data)
}
Problem: I run this function and nothing changes in my dataset. I still have the same number of missing values as I did before the function was run.
This line outside of the function works just as intended. The same with the code that deals with mean.
data[is.na(data[,i]),i] <- mode_function(data[,i])
But once I try to use my function to perform the exact same operations nothing happens.
The most likely reason for "nothing happening" is failing to assign a result to an R name/symbol. Perhaps trying this:
maybe_res <- broken(data)
Chaeck this:
> sapply(X, function(x) sum(is.na(x)))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
12 12 0 6 6 9 9 0 0 0 0 0 0 13 0 0
> sapply( broken(X), function(x) sum(is.na(x)))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I should warn you that mode functions are notorious for delivering answers that may not be what are desired.
I want to transpose a column in several smaller parts based on another column's values e.g.
1 ID1 V1
2 ID1 V2
3 ID1 V3
4 ID2 V4
5 ID2 V5
6 ID3 V6
7 ID3 V7
8 ID3 V8
9 ID3 V9
I wish to have all V values for each ID to be in one row e.g.
ID1 V1 V2 V3
ID2 V4 V5
ID3 V6 V7 V8 V9
Each id has different number of rows to transpose as shown in the example. If it is easier to use the serial number column to perform this then that is fine too.
Can anyone help ?
Here is a simple awk one-liner to do the trick:
awk '1 {if (a[$2]) {a[$2] = a[$2]" "$3} else {a[$2] = $3}} END {for (i in a) { print i,a[i]}}' file.txt
Output:
ID1 V1 V2 V3
ID2 V4 V5
ID3 V6 V7 V8 V9
If you like coding in Javascript this is how to do it on the command line using jline: https://github.com/bitdivine/jline/
mmurphy#violet:~$ cat ,,, | jline-foreach 'begin::global.all={}' line::'fields=record.split(/ +/);if(fields.length==3)tm.incrementPath(all,fields.slice(1))' end::'tm.find(all,{maxdepth:1},function(path,val){console.log(path[0],Object.keys(val).join(","));})'
ID1 V1,V2,V3
ID2 V4,V5
ID3 V6,V7,V8,V9
where the input is:
mmurphy#violet:~$ cat ,,,
1 ID1 V1
2 ID1 V2
3 ID1 V3
4 ID2 V4
5 ID2 V5
6 ID3 V6
7 ID3 V7
8 ID3 V8
9 ID3 V9
mmurphy#violet:~$
Explanation: This builds a tree where the first level of branches is the user ID and the second is the V (version?). You could do this for any number of levels. The leaves are just counters. First we create an empty tree:
'begin::global.all={}'
Then each line that comes in is split into counter, ID and version number. The counter is sliced off leaving just the array [userID,version]. incrementCounter creates those branches in the tree, a bit like mkdir -p, and increments the leaf counter although you don't actually need to know how often each user,version combination has been seen:
line::'fields=record.split(/ +/);if(fields.length==3)tm.incrementPath(all,fields.slice(1))' end::'tm.find(all,{maxdepth:1},function(path,val){console.log(path[0],Object.keys(val).join(","));})'
At the end we have tm.find which behaves just like UNIX find and prints every path in the tree. Except that that we limit the depth of the search to the desired breakdown (1, but if you're like me you'll be wanting to do a breakdown of 2,3,5 or 8 variables next). That way you have separated out the breakdown and your list of values and you can print your answer.
If you are never going to need deeper breakdowns you will probably want to stick with awk, as it's probably preinstalled.
I have this line in one my function - result[result>0.05] <- "", that replaces all values from my data frame grater than 0.05, including the row names from the first column. How to avoid this?
This is a fast way too:
df <- as.data.frame(matrix(runif(100),nrow=10))
df[-1][df[-1]>0.05] <- ''
Output:
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0.60105471
2 0.63340567
3 0.11625581
4 0.96227379 0.0173133104108274
5 0.07333583
6 0.05474430 0.0228175506927073
7 0.62610309
8 0.76867090
9 0.76684615 0.0459537433926016
10 0.83312158