read complicated dataset in R - r

My dataset look something like given below. The first number is the feature number and then colon and then the value associated with that specific feature. I am not sure how to import this dataset in R. Anyone has any ideas?
236:24 500:163 732:234 869:117 885:106 1249:103 1280:158 1889:119 2015:55 2718:126 3307:137 3578:25 3770:26 4139:128 4723:114 4957:82 5128:50 5420:124 5603:135 5897:34 5946:117 6069:154 6153:55 6347:87 6372:77 6666:109 6866:223 6984:39 7709:253 7950:87 8078:38 8945:141 9316:111 9948:103 9989:68 10276:43 10530:76 10532:55 10799:15 10802:20 10848:82 11347:16 11871:51 11883:105 12534:133 12601:13 12781:178 12798:116 12842:106 12916:7 12935:51 12968:154 13028:58 13330:105 13384:2 13568:47 13641:632 13829:18 13964:62 14385:93 14392:272 15280:140 15424:119 15492:52 15523:31 16311:23 16464:69 16478:94 16584:102 16586:107 16705:272 17138:108 17181:150 17526:280 17540:163 18007:114 18050:53 18180:2 18806:160 18943:73 19055:41 19255:88 19774:59 19889:72 19921:45
101:68 572:57 732:63 962:120 1304:61 1831:60 1889:58 1973:105 2518:161 2629:228 2990:158 3147:75 3578:11 3860:88 4011:18 4623:141 4684:411 4758:69 4820:120 6149:102 6234:134 6306:118 6866:147 6927:89 6988:51 7048:178 7193:31 7257:61 7709:229 8061:125 8202:188 8272:17 8759:165 9104:77 9325:135 9860:97 10055:684 10532:180 10735:64 10744:267 10820:120 10848:186 10923:128 10936:129 11203:160 11303:144 11668:87 11867:97 11871:207 12191:83 12238:193 12380:51 12968:164 13369:58 13929:39 14531:102 14800:130 14931:99 15314:91 15632:62 16165:7 16353:120 16584:137 17216:172 18372:31 18893:75 19133:93 19154:101 19165:133 19607:20 19784:141 19889:97 19921:60

Assuming your data is stored in input.txt,
input <- scan('input.txt', what = 'character')
data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 18943 73 19055 41 19255 ...
Alternatively, you can use read.table to parse the input rather than manually splitting the strings which is slightly slower but more readable.
data <- read.table(text = input, sep = ':')
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 18943 73 19055 41 19255 ...
Edit: adapted for your dataset. Reads your Feature/Value pairs into a data frame.
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/dexter/DEXTER/dexter_test.data'
input <- scan(url, what = 'character')
data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature','Value')
str(data)
# 'data.frame': 192449 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 79 10848 105 11018 76 ...

Related

Create a dataframe i nR

I would like to create a dataframe with 117 columns and 90 rows, the first ones being: ID, date1, date2, Category, DR1, DRM01, DRM02, DRM03 .... up to DRM111. For the first column, it would have values ranging from 1 to 3. In date1 it would have a fixed value, which would be "2022-01-05", in date2, it would have values between 2021-12-20 to the maximum that it gives. Category can be ABC or ERF, in DR1 would be values that would vary from 200 to 250, and finally, in DRM columns, would be values that would vary from 0 to 300. Is it possible to create a dataframe like this?
I wondering if this is an effort at simulation. The first few tasks seem blindly obvious but the last call to replicate with simplify=FALSE might have been a bit less than trivial.
test <- data.frame( ID = rep(1:3, length=90),
date1 = as.Date( "2022-01-05"),
date2= seq( as.Date("2021-12-20"), length.out=90, by=1),
#Category = ???? so far not specified
DR1 = sample( 200:250, 90, repl=TRUE), #need repl is length need is long
setNames( replicate(111, { sample(0:300, 90)}, simplify=FALSE) ,
nm=paste("DRM",1:111) ) )
Snipped the last 105 rows of the output from str:
str(test)
'data.frame': 90 obs. of 115 variables:
$ ID : int 1 2 3 1 2 3 1 2 3 1 ...
$ date1 : Date, format: "2022-01-05" "2022-01-05" "2022-01-05" "2022-01-05" ...
$ data2 : Date, format: "2021-12-20" "2021-12-21" "2021-12-22" "2021-12-23" ...
$ DR1 : int 229 218 240 243 221 202 242 221 237 208 ...
$ DRM.1 : int 41 238 142 100 19 56 224 152 85 84 ...
$ DRM.2 : int 150 185 141 55 34 83 88 105 165 294 ...
$ DRM.3 : int 144 22 237 174 78 291 120 63 261 236 ...
$ DRM.4 : int 223 105 263 214 45 226 129 80 182 15 ...
$ DRM.5 : int 27 108 288 237 129 251 150 70 300 243 ...
# additional rows elided
The last item in that construction returns a list that has 111 "columns" with ascending numbered names. I admit to being puzzled about why there were periods in the DRM names but then realized that the data.frame function uses check.names to make sure they are legitimate, so the spaces from paste were converted to periods. If you don't like periods then use paste0.

I was trying to mutate a new numeric column in a dataframe but the compliler is taking it as char's and i am not even able to access it using index

library(dslabs)
data(heights)
library(dplyr)
mutate(heights, ht_cm = height * 2.54, stringsAsFactor = FALSE )
str(heights) # not showing ht_cm as a variable in the data frame
mean(heights$ht_cm) # giving error that argument is not numeric
You just used mutate, but if you want to add the new column in height you need to:
Code
heights <-
heights %>%
mutate(ht_cm = height * 2.54)
Output
str(heights)
'data.frame': 1050 obs. of 3 variables:
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 2 ...
$ height: num 75 70 68 74 61 65 66 62 66 67 ...
$ ht_cm : num 190 178 173 188 155 ...

R: Iterate through a for loop to print multiple tables

In the house price prediction dataset, there are about 80 variables and 1459 obs.
To understand the data better, I have segregated the variables which are 'char' type.
char_variables = sapply(property_train, is.character)
char_names = names(property_train[,char_variables])
char_names
There are 42 variables that are char datatype.
I want to find the number of observations in each variable.
The simple code for that would be:
table(property_train$Zoning_Class)
Commer FVR RHD RLD RMD
10 65 16 1150 218
But repeating the same for 42 variables would be a tedious task.
The for loops I've tried to print all the tables show error.
for (val in char_names){
print(table(property_train[[val]]))
}
Abnorml AdjLand Alloca Family Normal Partial
101 4 12 20 1197 125
Is there a way to iterate the char_names through the dataframe to print all 42 tables.
str(property_train)
'data.frame': 1459 obs. of 81 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ Building_Class : int 60 20 60 70 60 50 20 60 50 190 ...
$ Zoning_Class : chr "RLD" "RLD" "RLD" "RLD" ...
$ Lot_Extent : int 65 80 68 60 84 85 75 NA 51 50 ...
$ Lot_Size : int 8450 9600 11250 9550 14260 14115 10084 10382..
$ Road_Type : chr "Paved" "Paved" "Paved" "Paved" ...
$ Lane_Type : chr NA NA NA NA ...
$ Property_Shape : chr "Reg" "Reg" "IR1" "IR1" ...
$ Land_Outline : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
Actually, for me your code does not give an error (make sure to evaluate all lines in the for-loop together):
property_train <- data.frame(a = 1:10,
b = rep(c("A","B"),5),
c = LETTERS[1:10])
char_variables = sapply(property_train, is.character)
char_names = names(property_train[,char_variables])
char_names
table(property_train$b)
for (val in char_names){
print(table(property_train[val]))
}
You can also get this result in a bit more user-friendy form using dplyr and tidyr by pivoting all the character columns into a long format and counting all the column-value combinations:
library(dplyr)
library(tidyr)
property_train %>%
select(where(is.character)) %>%
pivot_longer(cols = everything(), names_to = "column") %>%
group_by(column, value) %>%
summarise(freq = n())

How can I fully extract all elements into a data frame?

I retrieve some data from an API and convert it to a flat structure.
library(httr)
url <- "https://api.carbonintensity.org.uk/intensity/2019-11-25/2019-11-26"
raw_original <- GET(url)
raw <- rawToChar(raw_original$content)
raw <- fromJSON(raw)
api_extr <- do.call("rbind", lapply(raw, data.frame))
At first, all seems well (a 5-column data frame):
> head(api_extr)
from to intensity.forecast intensity.actual intensity.index
1 2019-11-24T23:30Z 2019-11-25T00:00Z 210 200 moderate
2 2019-11-25T00:00Z 2019-11-25T00:30Z 199 200 moderate
3 2019-11-25T00:30Z 2019-11-25T01:00Z 200 198 moderate
4 2019-11-25T01:00Z 2019-11-25T01:30Z 204 189 moderate
5 2019-11-25T01:30Z 2019-11-25T02:00Z 199 191 moderate
6 2019-11-25T02:00Z 2019-11-25T02:30Z 192 193 moderate
However, one of the columns (intensity) is in fact a data frame which contains three further columns.
> str(api_extr)
'data.frame': 49 obs. of 3 variables:
$ from : chr "2019-11-24T23:30Z" "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" ...
$ to : chr "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" "2019-11-25T01:30Z" ...
$ intensity:'data.frame': 49 obs. of 3 variables:
..$ forecast: int 210 199 200 204 199 192 191 194 197 192 ...
..$ actual : int 200 200 198 189 191 193 197 193 193 194 ...
..$ index : chr "moderate" "moderate" "moderate" "moderate" ...
I would expect the data frame to have five columns whereas instead it only has three.
At first glance this may seem insignificant, but the problems will start when it comes to working with the data (i.e. plotting it).
How can I achieve five columns?
You can pass the URL directly to fromJSON and flatten the result in a single step.
library(jsonlite)
url <- "https://api.carbonintensity.org.uk/intensity/2019-11-25/2019-11-26"
df <-fromJSON(url, flatten = TRUE)[[1]]
str(df)
'data.frame': 49 obs. of 5 variables:
$ from : chr "2019-11-24T23:30Z" "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" ...
$ to : chr "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" "2019-11-25T01:30Z" ...
$ intensity.forecast: int 210 199 200 204 199 192 191 194 197 192 ...
$ intensity.actual : int 200 200 198 189 191 193 197 193 193 194 ...
$ intensity.index : chr "moderate" "moderate" "moderate" "moderate" ...

TM - Clustering data with special date variable

Ive got the following data from tripadvisor:
'data.frame': 682 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : Factor w/ 674 levels "id","rn106322397",..: 672 671 670 669 668 667 666 665 664 663 ...
$ quote : Factor w/ 606 levels "\"Picturesque Lake Konigssee\"",..: 389 139 113 149 384 39 176 598 199 603 ...
$ rating : Factor w/ 6 levels "1","2","3","4",..: 3 5 5 5 4 5 5 5 4 5 ...
$ date : Factor w/ 505 levels "date","Reviewed 1 August 2014\n",..: 200 200 427 427 427 443 434 351 313 494 ...
$ reviewnospace: Factor w/ 674 levels "- Good car parking facilities- Organized boat trips- Ensure that you have enough time at hand for the boat trip",..: 624 573 144 211 507 26 351 672 451 249 ...
I try to cluster the data on the basis of the date, to get two groups - winter and summer vacationers. With this clustering i want to analyse the reviews afterwards. I am using the tm package and tried it with the following code:
> x <- read.csv ("seeganz.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",")
> corp <- VCorpus(VectorSource(x$reviewnospace), readerControl = list(language = "eng"))
> meta(corp,tag = "date") <- x$date
> idx <- meta(corp, "date") == 'December'
But it is not working as the content say 0 documents:
> corp [idx]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 1
Content: documents: 0
As the date has the structure "Reviewed 1 August 2014", how do I have to adapt this code to get, for example just the reviews from Nov - Feb?
Do you have any idea how I can solve this problem?
Thank you.
Generic Approach:
Use substr(date, 10, nchar(date)) to get to 1 August 2014 call this new vector dateNew
Use normal date function e.g. as.Date(dateNew,...) to change dateNew into a vector of type Date where you can do subsetting/subtraction and other operations
References from http://www.statmethods.net/input/dates.html
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]

Resources