R - Performance between subset and creating DF from vectors - r

I was wondering if there is a huge performance difference/impact for large datasets when you try to subset the data.
In my scenario, I have a dataframe with just under 29,000 records/data.
When I had to subset the data, I thought of 2 ways to do this.
The data is read from a csv file using reactive.
option 1
long_lat_df <- reactive({
long_lat <- subset(readFile(), select=c(Latitude..deg.,Longitude..deg.))
return(long_lat)
})
option 2
what I had in mind was to extract the 2 columns and assign the 2 columns to its own variable long and lat. From there I can combine the 2 columns to form a new data frame where I can use it to work with spatial analysis.
Would there be a potential performance impact between the 2 options?

Related

R read csv with observations in colums

in every example I have seen so far for reading csv files in R, the variables are in columns and the observations (individuals) are in rows. In a introductory statistics course I am taking there is an example table where the (many) variables are in rows and the (few) observations are in columns. Is there a way to read such a table so that you get a dataframe in the usual "orientation"?
Here is a solution that uses the tidyverse. First, we gather the data into narrow format tidy data, then we spread it back to wide format, setting the first column as a key for the gathered observations by excluding it from gather().
We'll demonstrate the technique with state level summary data from the U.S. Census Bureau.
I created a table of population data for four states, where the states (observations) are in columns, and the variables are listed in rows of the table.
To make the example reproducible, we entered the data into Excel and saved it as a comma separated values file, which we assign to a vector in R and read with read.csv().
textFile <- "Variable,Georgia,California,Alaska,Alabama
population2018Estimate,10519475,39557045,737438,4887871
population2010EstimatedBase,9688709,37254523,710249,4780138
pctChange2010to2018,8.6,6.2,3.8,2.3
population2010Census,8676653,37253956,710231,4779736"
# load tidyverse libraries
library(tidyr)
library(dplyr)
# first gather to narrow format then spread back to wide format
data %>%
gather(.,state,value,-Variable) %>% spread(Variable,value)
...and the results:
state pctChange2010to2018 population2010Census
1 Alabama 2.3 4779736
2 Alaska 3.8 710231
3 California 6.2 37253956
4 Georgia 8.6 8676653
population2010EstimatedBase population2018Estimate
1 4780138 4887871
2 710249 737438
3 37254523 39557045
4 9688709 10519475

Making a New R DataFrame from 2 Existing Ones

I have 2 data frames in R. One has 187 observations and one has 195. I need to create a new data frame consisting of only the 8 observations that are not common between the two. Data frame 1 (with 195 observations) is called merged. Data frame 2 (with 187 observations) is called merged 2013. There is a column called Country.Code in both data frames and each observation has a unique code that would separate it from the others. How can I complete this task? Please list a function and explain it if possible!
Thank you!
Try using logical indexing. This returns the subset of rows where the Country.Code's don't match:
merged[ !(merged$Country.Code %in% merged2013$Country.Code) , ]
Edited the names of the dataframes to match the question.

R- table of a feature in dataframe, but only if x occurances

I have a data frame in R, and when I do something such as:
table(data$brand)
I get about a hundred factors (many with 0 after cleaning data), and many with only 1 or 2 occurances. I only care about ones that there are >50 occurrences of. Is there a way to get a table like this instead of reading through the long list?
We can subset
tbl <- table(data$brand)
tbl[tbl > 50]

Transpose/Reshape Data in R

I have a data set in a wide format, consisting of two rows, one with the variable names and one with the corresponding values. The variables represent characteristics of individuals from a sample of size 1000. For instance I have 1000 variables regarding the size of each individual, then 1000 variables with the height, then 1000 variables with the weight etc. Now I would like to run simple regressions (say weight on calorie consumption), the only way I can think of doing this is to declare a vector that contains the 1000 observations of each variable, say for instance:
regressor1=c(mydata$height0, mydata$height1, mydata$height2, mydata$height3, ... mydata$height1000)
But given that I have a few dozen variables and each containing 1000 observations this will become cumbersome. Is there a way to do this with a loop?
I have also thought a about the reshape options of R, but this again will put me in a position where I have to type 1000 variables a few dozen times.
Thank you for your help.
Here is how I would go about your issue. t() will transpose the data for you from many columns to many rows.
Note: t() can be used with a matrix rather than a data frame, I simply coerced to data frame to show my example will work with your data.
# Many columns, 2 rows
x <- as.data.frame(matrix(nrow=2,ncol=1000,seq(1:2000)))
#2 Columns, many rows
t(x)
Based on your comments you are looking to generate vectors.
If you have transposed:
regressor1 <- x[,1]
regressor2 <- x[,2]
If you have not transposed:
regressor1 <- x[1,]
regressor2 <- x[2,]

In R, how do I select rows of entries in one dataframe by identifiers from a second datafrmame [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
In R, how do I subset a data.frame by values from another data.frame?
I have two data.frames. The first (df1) is a single column of 100 entries with header - "names". The second (df2) is a dataframe containing hundreds of columns of metadata for tens of thousands of entries. The first column of df2 also has the header "names".
I simply want to select all the metadata in df2 by the subset of names found in df1.
Please help this novice R user. Thank you!
You can use data.frame with %in% but it can be slow if you have many thousands of names to look up.
I would recommend using data.table because it sorts the index columns and can do an almost instantaneous database join even with millions of records. Read the data.table documentation for more information.
Suppose you have a big data.frame and little data.frame:
library(data.table)
big <- data.frame(names=1:5, data=1:5)
small <- data.frame(names=c(1, 3, 6))
Make them into data.table objects and set the key column to be names.
big <- data.table(big, key='names')
small <- data.table(small, key='names')
Now perform the join. [] in data.table allows a data.table to be indexed by the key column of another data.table. In this case, we return the rows of big that are also in small, and there will be missing data if there are names in small but not in big.
big[small]
# names data
# 1: 1 1
# 2: 3 3
# 3: 6 NA

Resources