My data frame contains five character columns and one numeric column. When I export the data frame all columns convert to strings including the numeric column. How do I avoid this?
The structure of my data frame:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3194 obs. of 6 variables:
$ State_FIPS_Code : chr "00" "01" "01" "01" ...
$ County_FIPS_Code : chr "000" "000" "001" "003" ...
$ Postal_Code : chr "US" "AL" "AL" "AL" ...
$ Name : chr "United States" "Alabama" "Autauga County" "Baldwin County" ...
$ Poverty_Percent_All_Ages: num 14.7 18.5 12.7 12.9 32 22.2 14.7 39.6 25.8 20 ...
$ geoid : chr "00000" "01000" "01001" "01003" ...
Export code:
write.csv(dfGeoid, file = "MyData.csv", row.names=T)
Related
I would like to make some specific calculation within a large dataset.
This is my MWE using an API call (takes 3-4 sec ONLY to Download)
devtools::install_github('mingjerli/IMFData')
library(IMFData)
fdi_asst <- c("BFDA_BP6_USD","BFDAD_BP6_USD","BFDAE_BP6_USD")
databaseID <- "BOP"
startdate <- "1980-01-01"
enddate <- "2016-12-31"
checkquery <- FALSE
FDI_ASSETS <- as.data.frame(CompactDataMethod(databaseID, list(CL_FREA = "Q", CL_AREA_BOP = "", CL_INDICATOR_BOP= fdi_asst), startdate, enddate, checkquery))
my dataframe 'FDI_ASSETS' looks like this (I provide a picture instead of head() for convenience)
the last column is a list and contains three more variables:
head(FDI_ASSETS$Obs)
[[1]]
#TIME_PERIOD #OBS_VALUE #OBS_STATUS
1 1980-Q1 30.0318922812441 <NA>
2 1980-Q2 23.8926174547104 <NA>
3 1980-Q3 26.599634375058 <NA>
4 1980-Q4 32.7522451203517 <NA>
5 1981-Q1 44.124979234001 <NA>
6 1981-Q2 35.9907120805994 <NA>
MY SCOPE
I want to do the following:
if/when the "#UNIT_MULT == 6" then divide the "#OBS_VALUE" in FDI_ASSETS$Obs by 1000
if/when the "#UNIT_MULT == 3" then divide the "#OBS_VALUE" in FDI_ASSETS$Obs by 1000000
UPDATE
Structure of FDI_ASSETS looks like this:
str(FDI_ASSETS)
'data.frame': 375 obs. of 6 variables:
$ #FREQ : chr "Q" "Q" "Q" "Q" ...
$ #REF_AREA : chr "FI" "MX" "MX" "TO" ...
$ #INDICATOR : chr "BFDAE_BP6_USD" "BFDAD_BP6_USD" "BFDAE_BP6_USD" "BFDAD_BP6_USD" ...
$ #UNIT_MULT : chr "6" "6" "6" "3" ...
$ #TIME_FORMAT: chr "P3M" "P3M" "P3M" "P3M" ...
$ Obs :List of 375
..$ :'data.frame': 147 obs. of 3 variables:
.. ..$ #TIME_PERIOD: chr "1980-Q1" "1980-Q2" "1980-Q3" "1980-Q4" ...
.. ..$ #OBS_VALUE : chr "30.0318922812441" "23.8926174547104" "26.599634375058" "32.7522451203517" ...
.. ..$ #OBS_STATUS : chr NA NA NA NA ...
..$ :'data.frame': 60 obs. of 2 variables:
.. ..$ #TIME_PERIOD: chr "2001-Q1" "2001-Q3" "2002-Q1" "2002-Q2" ...
.. ..$ #OBS_VALUE : chr "9.99999999748979E-05" "9.99999997475243E-05" "9.8999999998739E-05" "-9.90000000342661E-05" ...
..$ :'data.frame': 63 obs. of 2 variables:
.. ..$ #TIME_PERIOD: chr "2001-Q1" "2001-Q2" "2001-Q3" "2001-Q4" ...
.. ..$ #OBS_VALUE : chr "130.0149" "189.627" "3453.8319" "630.483" ...
..$ :'data.frame': 17 obs. of 2 variables:
I downloaded your data and it is quite complicated. I have removed my wrong answer so that you can get it answered by #akrun or someone similar :) I don't have the time to parse through it right now.
I found the following solution
list_assets<-list(FDI_ASSETS=FDI_ASSETS, Portfolio_ASSETS=Portfolio_ASSETS, other_invest_ASSETS=other_invest_ASSETS, fin_der_ASSETS=fin_der_ASSETS, Reserves=Reserves)
for (df in list_assets){
for( i in 1:length(df$"#UNIT_MULT")){
if (df$"#UNIT_MULT"[i]=="6"){
df$Obs[[i]]$"#OBS_VALUE" <- as.numeric(df$Obs[[i]]$"#OBS_VALUE")
df$Obs[[i]]$"#OBS_VALUE" <- df$Obs[[i]]$"#OBS_VALUE"/1000
} else if ((df$"#UNIT_MULT"[i]=="3")){
df$Obs[[i]]$"#OBS_VALUE" <- as.numeric(df$Obs[[i]]$"#OBS_VALUE")
df$Obs[[i]]$"#OBS_VALUE" <- df$Obs[[i]]$"#OBS_VALUE"/1000000
}
}
}
Please let me know how I can modify the code in order to make it more efficient and avoid these loops.
After importing data from a JSON stream, I have a data frame that is 621 lists of the same 22 variables.
List of 621
$ :List of 22
..$ _id : chr "55c79e711cbee48856a30886"
..$ number : num 1
..$ country : chr "Yemen"
..$ date : chr "2002-11-03T00:00:00.000Z"
..$ narrative : chr ""
..$ town : chr ""
..$ location : chr ""
..$ deaths : chr "6"
..$ deaths_min : chr "6"
..$ deaths_max : chr "6"
..$ civilians : chr "0"
..$ injuries : chr ""
..$ children : chr ""
..$ tweet_id : chr "278544689483890688"
..$ bureau_id : chr "YEM001"
..$ bij_summary_short: chr ""
..$ bij_link : chr ""
..$ target : chr ""
..$ lat : chr "15.47467"
..$ lon : chr "45.322755"
..$ articles : list()
..$ names : chr ""| __truncated__
$ :List of 22
..$ _id : chr "55c79e711cbee48856a30887"
..$ number : num 2
..$ country : chr "Pakistan"
..$ date : chr "2004-06-17T00:00:00.000Z"
..$ narrative : chr ""
..$ town : chr ""
..$ location : chr ""
..$ deaths : chr "6-8"
..$ deaths_min : chr "6"
..$ deaths_max : chr "8"
..$ civilians : chr "2"
..$ injuries : chr "1"
..$ children : chr "2"
..$ tweet_id : chr "278544750867533824"
..$ bureau_id : chr "B1"
..$ bij_summary_short: chr ""| __truncated__
..$ bij_link : chr ""
..$ target : chr ""
..$ lat : chr "32.30512565"
..$ lon : chr "69.57624435"
..$ articles : list()
..$ names : chr ""
...
How can I combine these lists into one data frame of 621 observations of 22 variables? Notice that all 621 lists are unnamed.
edit: Per request, here is how I got this data set:
library(rjson)
url <- 'http://api.dronestre.am/data'
document <- fromJSON(file=url, method='C')
str(document$strike)
Can you provide example on how you generated the data ? I did not test the answer but, the following should help. If you can update the Q, on how you came up with the data, I can work to try that.
update
library(rjson)
library(data.table)
library(dplyr)
url <- 'http://api.dronestre.am/data'
document <- fromJSON(file=url, method='C')
is(document)
listdata<- document$strike
df<-do.call(rbind,listdata) %>% as.data.table
dim(df)
purrr has a useful transpose function which 'inverts' a list. The $articles element causes trouble as it appears always to be empty, and scuppers you when you try to convert to a data.frame, so I've subsetted for it.
library(purrr)
df <- transpose(document$strike) %>%
t %>%
apply(FUN = unlist, MARGIN = 2)
df <- df[-21] %>% data.frame %>% tbl_df
df
Source: local data frame [621 x 21]
X_id number country date
(fctr) (dbl) (fctr) (fctr)
1 55c79e711cbee48856a30886 1 Yemen 2002-11-03T00:00:00.000Z
2 55c79e711cbee48856a30887 2 Pakistan 2004-06-17T00:00:00.000Z
3 55c79e711cbee48856a30888 3 Pakistan 2005-05-08T00:00:00.000Z
4 55c79e721cbee48856a30889 4 Pakistan 2005-11-05T00:00:00.000Z
5 55c79e721cbee48856a3088a 5 Pakistan 2005-12-01T00:00:00.000Z
6 55c79e721cbee48856a3088b 6 Pakistan 2006-01-06T00:00:00.000Z
7 55c79e721cbee48856a3088c 7 Pakistan 2006-01-13T00:00:00.000Z
8 55c79e721cbee48856a3088d 8 Pakistan 2006-10-30T00:00:00.000Z
9 55c79e721cbee48856a3088e 9 Pakistan 2007-01-16T00:00:00.000Z
10 55c79e721cbee48856a3088f 10 Pakistan 2007-04-27T00:00:00.000Z
.. ... ... ... ...
Variables not shown: narrative (fctr), town (fctr), location (fctr), deaths
(fctr), deaths_min (fctr), deaths_max (fctr), civilians (fctr), injuries
(fctr), children (fctr), tweet_id (fctr), bureau_id (fctr), bij_summary_short
(fctr), bij_link (fctr), target (fctr), lat (fctr), lon (fctr), names (fctr)
My first time here so I hope I don't break anything...
I have a list of lists:
Browse[2]> head(str(mylist))
List of 33
$ : chr [1:33] "0001" "space" "28" "night_club" ...
$ : chr [1:33] "0002" "concert" "28" "night_club" ...
$ : chr [1:31] "0003" "night_club" "24" "martial_arts" ...
$ : chr [1:31] "0004" "stage" "24" "basketball" ...
$ : chr [1:43] "0005" "night_club" "16" "concert" ...
$ : chr [1:43] "0006" "night_club" "16" "concert" ...
$ : chr [1:39] "0007" "night_club" "22" "concert" ...
$ : chr [1:39] "0008" "night_club" "22" "concert" ...
$ : chr [1:31] "0009" "night_club" "46" "martial_arts" ...
$ : chr [1:31] "0010" "night_club" "46" "martial_arts" ...
$ : chr [1:41] "0011" "night_club" "17" "martial_arts" ...
$ : chr [1:41] "0012" "night_club" "17" "martial_arts" ...
$ : chr [1:29] "0013" "concert" "23" "night_club" ...
$ : chr [1:29] "0014" "concert" "23" "night_club" ...
$ : chr [1:25] "0015" "night_club" "26" "concert" ...
$ : chr [1:31] "0016" "night_club" "42" "concert" ...
$ : chr [1:31] "0017" "night_club" "42" "concert" ...
$ : chr [1:31] "0018" "night_club" "25" "wrestling" ...
$ : chr [1:31] "0019" "night_club" "25" "wrestling" ...
$ : chr [1:33] "0020" "night_club" "46" "wrestling" ...
$ : chr [1:33] "0021" "night_club" "46" "wrestling" ...
$ : chr [1:41] "0022" "concert" "21" "stage" ...
$ : chr [1:41] "0023" "concert" "21" "stage" ...
$ : chr [1:55] "0024" "basketball" "8" "concert" ...
$ : chr [1:55] "0025" "basketball" "8" "concert" ...
$ : chr [1:37] "0026" "bald_person" "26" "martial_arts" ...
$ : chr [1:37] "0027" "bald_person" "26" "martial_arts" ...
$ : chr [1:37] "0028" "night_club" "32" "business_meeting" ...
$ : chr [1:37] "0029" "night_club" "32" "business_meeting" ...
$ : chr [1:15] "0030" "night_club" "59" "stage" ...
$ : chr [1:37] "0031" "stage" "12" "night_club" ...
$ : chr [1:37] "0032" "stage" "12" "night_club" ...
$ : chr [1:33] "0033" "night_club" "23" "portrait" ...
I want to turn this list into a wide format data frame where the first column would be each of every inner list first element (i.e. "0001", "0002" etc.) and there would be all possible columns with categories exist in the file:
"space", "night_club", "concert", "marital_arts", "wrestling" etc.
meaning that I would a very wide data frame that each row will begin with some id (0001,0002,0003 ...) and the columns names would be again all categories in the file: "space", "night_club", "concert", "marital_arts", "wrestling" etc. and for each row, where the category exists for that id, it would populate the value next to the category from the list ("space" -> 28 from the first line for example).
I was trying to construct a normalized data frame with loops and then convert it to a wide format, but as data scales it would be a bad idea:
for (file in files){# iterate over files in folder
mylist <- strsplit(readLines(file), ":")
#close(mylist)
for (elem in mylist){
dataframe <- data.frame(frameid = numeric(), category = character(), nrow = length(unlist(elem)))
frameid <- rep.int(elem[[1]], length(elem)-1)
categories <- elem[-1:-1]
dataframe$frameid <- frameid
dataframe$category <- categories
}
}
Reproducible input output example:
dput of input:
list(c("0001", "space", "28", "night_club", "25"), c("0002",
"concert", "28", "night_club", "26"), c("0003", "night_club",
"24", "martial_arts", "27"), c("0004", "stage", "24", "basketball",
"30"))
output:
Dataframe
frameid, cat_space, cat_night_club, cat_concert, cat_martial_arts, cat_stage, cat_basketball
0001, 28, 25, 0, 0, 0, 0
0002, 0, 26, 28, 0, 0, 0
0003, 0, 24, 0, 27, 0, 0
0004, 0, 0, 0, 0, 24, 30
Here's a possibility. I've created the answer as a function, and commented what is happening at each stage. The basic idea is to:
Create a column of just the first items from each list element.
Create a two-column matrix of the rest of the items. This assumes that the data are nicely paired.
Create a data.frame of these two elements put together.
Use xtabs to convert the output to a wide format. Note that if there are duplicated combinations of "ID" and "var", the values would be added together because of the use of xtabs.
Here's the function:
myFun <- function(inList) {
## Extract the first value in each list element
ID <- vapply(inList, `[`, character(1L), 1)
## Convert the remaining elements into a two column matrix, first
## column as variable, second column as value. Bind all list
## elements together to a single 2-column mantrix.
varval <- do.call(rbind, lapply(inList, function(x) {
matrix(x[-1], ncol = 2, byrow = TRUE, dimnames = list(NULL, c("var", "val")))
}))
## Create a data.frame where ID is repeated to the same number of rows
## as the matrices found in varval.
temp <- data.frame(ID = rep(ID, (lengths(inList)-1)/2), varval)
## Convert the val columns to numeric
temp$val <- as.numeric(as.character(temp$val))
## Use xtabs to go from a "long" form to a "wide" form
xtabs(val ~ ID + var, temp)
}
Here it is applied to your sample data (assuming your data is called "L"):
myFun(L)
# var
# ID basketball concert martial_arts night_club space stage
# 0001 0 0 0 25 28 0
# 0002 0 28 0 26 0 0
# 0003 0 0 27 24 0 0
# 0004 30 0 0 0 0 24
I have a dataset where I'm planning to use ubRacing of unbalanced package. But this ubRacing only accepts numeric columns. Is there anyway I can convert all the chr columns to numeric through R?
Thanks
'data.frame': 31000 obs. of 22 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : int 56 57 37 40 56 45 59 41 24 25 ...
$ job : chr "housemaid" "services" "services" "admin." ...
$ marital : chr "married" "married" "married" "married" ...
$ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
$ default : chr "no" "unknown" "no" "no" ...
$ housing : chr "no" "no" "yes" "no" ...
$ loan : chr "no" "no" "no" "no" ...
$ contact : chr "telephone" "telephone" "telephone" "telephone" ...
$ month : chr "may" "may" "may" "may" ...
$ day_of_week : chr "mon" "mon" "mon" "mon" ...
It is not clear how to character columns should be converted to numeric. One possible option would be to convert the character class to factor and then coerce it to numeric. We loop through the columns of the dataset with lapply.
df1[] <- lapply(df1, function(x) if(is.character(x)) as.numeric(factor(x))
else (x))
I'm using 'pls' package and for this I need to produce a dataframe with a structure a bit different to what I'm used to.
Data frame structure I need to work with 'pls: gasoline
library(pls)
gasoline
Example to show how my data looks like: gasoline2
Background info - What I tend to do to load data into R is to transcript the data in a .xls and then convert the file to a .txt which is then loaded to R.
When my data is loaded it looks like this:
gasoline2 <- as.data.frame(as.matrix(gasoline))
Question
How can convert the structure of gasoline2 into the structure of gasoline?
Thanks a lot in advance for your help!
You're looking for I, which will allow you to combine different data structures (like lists or matrices) as columns in a data.frame:
## Assume you are starting with this:
X <- as.data.frame(as.matrix(gasoline))
## Create a new object where column 1 is the same as the first
## column in your existing data frame, and column 2 is a matrix
## of the remaining columns
newGas <- cbind(X[1], NIR = I(as.matrix(X[-1])))
str(gasoline)
# 'data.frame': 60 obs. of 2 variables:
# $ octane: num 85.3 85.2 88.5 83.4 87.9 ...
# $ NIR : AsIs [1:60, 1:401] -0.050193 -0.044227 -0.046867 -0.046705 -0.050859 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr "1" "2" "3" "4" ...
# .. ..$ : chr "900 nm" "902 nm" "904 nm" "906 nm" ...
str(newGas)
# 'data.frame': 60 obs. of 2 variables:
# $ octane: num 85.3 85.2 88.5 83.4 87.9 ...
# $ NIR : AsIs [1:60, 1:401] -0.050193 -0.044227 -0.046867 -0.046705 -0.050859 ...
# ..- attr(*, "dimnames")=List of 2
# .. ..$ : chr "1" "2" "3" "4" ...
# .. ..$ : chr "NIR.900 nm" "NIR.902 nm" "NIR.904 nm" "NIR.906 nm" ...
There's a slight difference in the column naming, but I think that can easily be taken care of...
> colnames(newGas$NIR) <- gsub("NIR.", "", colnames(newGas$NIR))
> identical(gasoline, newGas)
[1] TRUE