Convert delimited string to numeric vector in dataframe - r

This is such a basic question, I'm embarrassed to ask.
Let's say I have a dataframe full of columns which contain data of the following form:
test <-"3000,9843,9291,2161,3458,2347,22925,55836,2890,2824,2848,2805,2808,2775,2760,2706,2727,2688,2727,2658,2654,2588"
I want to convert this to a numeric vector, which I have done like so:
test <- as.numeric(unlist(strsplit(test, split=",")))
I now want to convert a large dataframe containing a column full of this data into a numeric vector equivalent:
mutate(data,
converted = as.numeric(unlist(strsplit(badColumn, split=","))),
)
This doesn't work because presumably it's converting the entire column into a numeric vector and then replacing a single row with that value:
Error in mutate_impl(.data, dots) : Column converted must be
length 20 (the number of rows) or one, not 1274
How do I do this?

Here's some sample data that reproduces your error:
data <- data.frame(a = 1:3,
badColumn = c("10,20,30,40,50", "1,2,3,4,5,6", "9,8,7,6,5,4,3"),
stringsAsFactors = FALSE)
Here's the error:
library(tidyverse)
mutate(data, converted = as.numeric(unlist(strsplit(badColumn, split=","))))
# Error in mutate_impl(.data, dots) :
# Column `converted` must be length 3 (the number of rows) or one, not 18
A straightforward way would be to just use strsplit on the entire column, and lapply ... as.numeric to convert the resulting list values from character vectors to numeric vectors.
x <- mutate(data, converted = lapply(strsplit(badColumn, ",", TRUE), as.numeric))
str(x)
# 'data.frame': 3 obs. of 3 variables:
# $ a : int 1 2 3
# $ badColumn: chr "10,20,30,40,50" "1,2,3,4,5,6" "9,8,7,6,5,4,3"
# $ converted:List of 3
# ..$ : num 10 20 30 40 50
# ..$ : num 1 2 3 4 5 6
# ..$ : num 9 8 7 6 5 4 3

This might help:
library(purrr)
mutate(data, converted = map(badColumn, function(txt) as.numeric(unlist(strsplit(txt, split = ",")))))
What you get is a list column which contains the numeric vectors.

Base R
A=c(as.numeric(strsplit(test,',')[[1]]))
A
[1] 3000 9843 9291 2161 3458 2347 22925 55836 2890 2824 2848 2805 2808 2775 2760 2706 2727 2688 2727 2658 2654 2588
df$NEw2=lapply(df$NEw, function(x) c(as.numeric(strsplit(x,',')[[1]])))
df%>%mutate(NEw2=list(c(as.numeric(strsplit(NEw,',')[[1]]))))

Related

How to mutate columns in place but keep same column types in R

Mutate in place is working fine as I set multiple dataframe columns blank if another dataframe column is blank. However, the mutated columns' types are changed. How to do this without changing column types?
Starting with data1:
I get data2:
Any ideas how to do this without changing any column types? Perhaps save all column types before the mutate and then set them back after the mutate?
Here's my code to create data1 and mutate to data2:
options(stringsasfactors = FALSE)
col_1_ferment <- c(452,768,856,192,905,752) #numeric type
col_1_crutch <- c('15','34','56','49','28','37') #character type
col_1_grease <- c(TRUE,TRUE,FALSE,FALSE,TRUE,FALSE) #boolean type
col_1_pump <- as.factor(c("3","6","3","2","1","2")) #factor type
indicator_col <- c(2,NA,2,1,1,2) #numeric type
data1 <- data.frame(col_1_ferment, col_1_crutch, col_1_grease, col_1_pump, indicator_col, check.rows = TRUE)
data2 <- data1 %>% mutate(dplyr::across(starts_with("col_1_"), ~ ifelse(is.na(indicator_col), "", .x)))
You can use NA instead of ""
data2 <- data1 %>% mutate(dplyr::across(starts_with("col_1_"), ~ ifelse(is.na(indicator_col), NA, .x)))
str(data2)
'data.frame': 6 obs. of 5 variables:
$ col_1_ferment: num 452 NA 856 192 905 752
$ col_1_crutch : chr "15" NA "56" "49" ...
$ col_1_grease : logi TRUE NA FALSE FALSE TRUE FALSE
$ col_1_pump : int 3 NA 3 2 1 2
$ indicator_col: num 2 NA 2 1 1 2

Saving dataframes stored in a list to individual files in R

I have a large list, merged_fin, containing 39 data.frames. The datasets look like this:
> merged_fin[[1]]
sourceid dstid speed
1 177 1 0.010604494
2 46 4 0.010794178
3 100 7 0.007286781
> merged_fin[[2]]
sourceid dstid speed
1 721 12 0.013830787
2 23 15 0.016334978
3 274 16 0.015247266
...
I would like to save each dataset in that list to its own .rds file in my working directory.
Trying:
for (i in 1:length(merged_fin)){
saveRDS(merged_fin[[i]])}
Or
saveRDS(merged_fin[[1]])
I get Error in saveRDS(merged_fin[[i]]) : 'file' must be non-empty string.
Trying:
lapply(names(merged_fin), function(i)
saveRDS(merged_fin[[i]], paste0(i, '.rds')))
I get list() but no file is saved to my working directory.
Notes:
(1) names(merged_fin) outputs NULL; (2) I initially coded merged_fin as an empty list (merged fin <- list()) before filling it with merged datasets I read in from different folders.
Does the problem lie in the way I'm referring to elements of the list?
Is it due to the way merged_fin was initially defined?
Thanks for your help.
Solution
In my case, it was simply a question of naming the elements of my list, which meenaparam suggested. I had a vector containing correctly ordered city names, which was called cities. I just did names(merged_fin) <- cities, and that was enough to successfully run
lapply(names(merged_fin), function(i)
saveRDS(merged_fin[[i]], paste0(i, '.rds')))
Following on from the previous answer, here's an example of how to assign and then grab names of the dataframes in your merged_fin list. Note that if you don't already have individual names for your dataframes, you can also simply assign them using names(merged_fin) <- c("name1", "name2") etc.
df1 <- read.table(h=T, text="
sourceid dstid speed
1 177 1 0.010604494
2 46 4 0.010794178
3 100 7 0.007286781")
df2 <- read.table(h=T, text="
sourceid dstid speed
1 721 12 0.013830787
2 23 15 0.016334978
3 274 16 0.015247266")
# make a list of dataframes
merged_fin <- list(df1, df2)
# see that the names of merged_fin are currently set to NULL
names(merged_fin)
#> NULL
# get the names of all the list-type objects in the workspace that contain the string "df" - we do this because dataframes are stored as lists
names_of_dataframes <- ls.str(mode = "list", pattern = "df")
names_of_dataframes
#> df1 : 'data.frame': 3 obs. of 3 variables:
#> $ sourceid: int 177 46 100
#> $ dstid : int 1 4 7
#> $ speed : num 0.0106 0.01079 0.00729
#> df2 : 'data.frame': 3 obs. of 3 variables:
#> $ sourceid: int 721 23 274
#> $ dstid : int 12 15 16
#> $ speed : num 0.0138 0.0163 0.0152
# assign the dataframe names back to our list of dataframes
names(merged_fin) <- names_of_dataframes
names(merged_fin)
#> [1] "df1" "df2"
# now we can write out the dataframes to files as each dataframe has a name
lapply(names(merged_fin), function(i)
saveRDS(merged_fin[[i]], paste0("~/Desktop/", i, '.rds')))
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
Created on 2020-01-21 by the reprex package (v0.3.0)

Convert comma separated decimals from character to numeric

For my exam i have to build some scatter plots in r. I created a data frame with 4 variables. with this data frame i want to add regression lines to my scatter plots.
the name of my data frame is "alle".
variable names are: demo, tot, besch, usd
with this code i tried to line the regression line but got following result:
reg1<- lm(tot~demo, data=alle)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
here is the structure of "alle"
str(alle)
'data.frame': 11 obs. of 4 variables:
$ demo : chr "498.300.775" "500.297.033" "502.090.235" "503.170.618" ...
$ tot : Factor w/ 11 levels "4.846.423","4.871.049",..: 1 3 4 5 2 8 7 6 10 9 ...
$ besch: Factor w/ 9 levels "68,4","68,6",..: 5 7 3 2 2 1 1 4 6 8 ...
$ usd : Factor w/ 44 levels "0,68434","0,72584",..: 26 30 29 23 28 22 24 25 15 14 ...
Tried to convert column "demo" to numeric with
alle$demo <- as.numeric(as.character(alle$demo))
it converted the column to numeric but now the rows are full with "NA"s.
I think that i all columns must be numeric.
How can I convert all 4 columns to numeric and finally plot the regression lines.
Data:
> head(alle,6)
demo tot besch usd
1 498.300.775 4.846.423 69,8 1,3705
2 500.297.033 4.891.934 70,3 1,4708
3 502.090.235 4.901.358 69,0 1,3948
4 503.170.618 4.906.313 68,6 1,3257
5 502.964.837 4.871.049 68,6 1,3920
6 504.047.964 5.010.371 68,4 1,2848
thanks
Try doing it in two steps. First get rid of the dots, then replace the commas by decimal points and coerce to numeric.
alle[] <- lapply(alle, function(x) gsub("\\.", "", x))
alle[] <- lapply(alle, function(x) as.numeric(sub(",", ".", x)))
Note:
The above solution is broken in two for readability. The following does the same but it takes just one lapply loop and should therefore be faster if the dataset is big. If the dataset is small to medium, maybe the two steps solutions is preferable.
alle[] <- lapply(alle, function(x){
as.numeric(sub(",", ".", gsub("\\.", "", x)))
})
With dplyr:
library(dplyr)
alle %>%
mutate_all(as.character) %>%
mutate_at(c("besch","usd"),function(x) as.numeric(as.character(gsub(",",".",x)))) ->alle
demo tot besch usd
1 498.300.775 4.846.423 69.8 1.3705
2 500.297.033 4.891.934 70.3 1.4708
3 502.090.235 4.901.358 69.0 1.3948
4 503.170.618 4.906.313 68.6 1.3257
5 502.964.837 4.871.049 68.6 1.3920
6 504.047.964 5.010.371 68.4 1.2848

Choosing multiple columns and changing their classes using a lookup table in R?

Is it possible to use a lookup table to assign/change the classes of variables in a data frame in R? I have thousands of columns with messed up classes in one data frame (my_df), and list of what they should be in another data frame (my_lt). PSEUDO CODE I was thinking something like use my_lt$variable_name and grep() on colnames(my_df) and pass the output through as.numeric if lt$variable_class == "numeric", with some form of if..else. Any help would be much appreciated!
input - my data frame (my_df)
my_df = data.frame(q1_hight_1=c(12,31,22,12),q1_hight_2=c(24,54,23,32),q1_hight_3=c(34,23,65,34),q2_shoe_size_1=c(2,2,3,4),q2_shoe_size_2=c(4,3,3,4))
input - my lookup table (my_lt)
my_lt = data.frame(variable_name=c("hight","shoe_size"),variable_class=c("numeric","integer"))
desired output (when checking classes)
$q1_hight_1 [1] "numeric" $q1_hight_2 [1] "numeric" $q1_hight_3 [1] "numeric" $q2_shoe_size_1 [1] "integer" $q2_shoe_size_2 [1] "integer"
This does the trick, given that there's no trap in the names you give to your variables (I use a very naïve grep).
library(dplyr)
library(purr)
map2(as.character(my_lt$variable_name),
as.character(my_lt$variable_class),
function(nam,cl){ map(grep(nam,names(my_df)),function(i){class(my_df[[i]]) <<- cl})})
str(my_df)
# 'data.frame': 4 obs. of 5 variables:
# $ q1_hight_1 : num 12 31 22 12
# $ q1_hight_2 : num 24 54 23 32
# $ q1_hight_3 : num 34 23 65 34
# $ q2_shoe_size_1: int 2 2 3 4
# $ q2_shoe_size_2: int 4 3 3 4

R - How to extract an element from a single column data frame?

I have a data frame and need to access the 1st row in the 1st column (Negative=16)
[[1]]
data
Negative 16
Neutral 36
Positive 28
Very Negative 7
Very Positive 19
List of 1
$ :'data.frame': 5 obs. of 1 variable:
..$ data: int [1:5] 16 36 28 7 19
I have tried the following:
x(1,1)
# Error in x(1, 1) : could not find function "x"
x[1,1]
# Error in x[1, 1] : incorrect number of dimensions
x['Negative',1]
# Error in x["Negative", 1] : incorrect number of dimensions
x['Negative']
# $<NA>
# NULL
You can read only the first column from a data frame like this:
x <- df[1,, drop = FALSE]

Resources