R - Export large dataframe into CSV - r

Beginner here: I have a list (see screenshot) called Coins_list from which I want to export the second dataframe stored in it called data into a csv. When I use the code
write.csv(Coins_list$data, file = "Coins_list_full_data.csv")
I get a huge CSV with a bunch of numbers from the column named price which apparently containts more dataframes, if I read the output correctly or at least display the data in the price column? How can I export this dataframe into CSV correctly? See screenshot for more details.
EDIT: I was able to get the first four rows into CSV by using df2 <- Coins_list$data write.csv(df2[1:4,], file="BTC_row.csv"), however it now looks like R puts the price of all four rows within a list c( ) and repeats it in each row? Any idea how to change that?

(I would post this as a comment but I have too few reputation)
Hey, you could try for starters to flatten the json file by going further than response list$content but looking at what's into the content with another $.
Else you could try getting data$price and see what pops up from there.
something like this:
names = list(data$symbol)
df = data.frame(price = NA, symbol = NA)
for (i in length(data)) {
x = data.frame(price = data$price[i], symbol = names[i])
df = inner_join(df, data)
}
to get a dataframe with price and symbol. I don't know how the data is nested so I'm just guessing.
It would be helpful to know from where you got the data for reproducibility.

Related

Importing data in R from Excel with information cotained in header

As title says, I am trying to import data from Excel to R, where part of the information is contained in the header.
I a very simplified way, the Excel I have looks like this:
GROUP;1234
MONTH;"Jan"
PERSON;SEX;AGE;INCOME
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
Total;;;147000
After reading in to R, it should be a "clean" dataset that looks like this.
GROUP;MONTH;PERSON;SEX;AGE;INCOME
1234;Jan;John;m;26;20000
1234;Jan;Michael;m;24;40000
1234;Jan;Phillip;m;25;15000
1234;Jan;Laura;f;27;72000
I have several files that look like this. The number of persons however varies in each file. The last line contains a summary that should be skipped. There might be empty lines between the list and summary line.
Any help is higly apreciated.Thank you very much.
Excel files can be read using readxl::read_excel()
One of the parameters is skip, using which you can skip certain number of rows defined by you.
For your data, you need to skip the first two lines that contain GROUP and MONTH.
You will get the data in following format.
PERSON;SEX;AGE;INCOME;
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
After this, you can manually add the columns GROUP and MONTH
Thank you very much for your help. The hint from #Aurèle brought the missing puzzle piece. The solution I have now come up with is as follows:
group <- read_excel("TEST1.xlsx", col_names = c("C1","GROUP") ,n_max = 1)
group <- group[,2]
month <- read_excel("TEST1.xlsx", col_names = c("C1","MONTH") ,skip = 1, n_max = 1)
month <- month[,2]
data <- read_excel("TEST1.xlsx", col_names = c("NAME","SEX","AGE","INCOME") , skip = 4)
data <- data[data$AGE != NA,]
data <- cbind(data,group,month)
data

Searching for target in Excel spreadsheet using R

As an R noob, I'm currently rather stumped by what is probably a rather trivial problem. I have data that looks like in the second image below, essentially a long sheet of rows with values in three columns. What I need is for a way to scan the sheet looking for particular combinations of values in the first and second column - combinations that are specified in a second spreadsheet of targets (see picture 1). When that particular combination is found, I need the script to extract the whole row in question from the data file.
So far, I've managed to read the files without problem:
library(xlsx)
folder <- 'C:\\Users\\...\\Desktop\\R EXCEL test\\'
target_file <- paste(folder,(readline(prompt = "Enter filename for target list:")),sep = "")
data_file <- paste(folder,(readline(prompt = "Enter data file:")),sep = "")
targetsDb <- read.xlsx(target_file, sheetName = "Sheet1")
data <- read.xlsx(data_file, sheetName = "Sheet1")
targets <- vector(mode = "list", length = 3)
for(i in 1:nrow(targetsDb)){
targets[[i]] <- c(targetsDb[i,1],targetsDb[i,2])
}
And with the last command I've managed to save the target combinations as items in a list. However, I run into trouble when it comes to iterating through the file looking for any of those combinations of cell values in the first two columns. My approach was to create a list with one item,
SID_IA <- vector(mode = "list", length = 1)
and to fill it with the values of column 1 and 2 iteratively for each row of the data file:
for(n in 1:nrow(data)){
SID_IA[[n]] <- c(data[n,1],data[n,2])
I would then nest another for loop here, which basically goes through every row in the targets sheet to check if the combination of values currently in the SID_IA list matches any of the target ones. Then at the end of the loop, the list is emptied so it can be filled with the following combination of data values.
for(i in targets){
if(SID_IA[[n]] %in% targets){
print(SID_IA[[n]], "in sentence" , data[n,1], "is ", data[n,3])
}else{
print(FALSE)
}
SID_IA[[n]] <- NULL
}
}
However, if I try to run that last loop, it returns the following output and error:
[1] FALSE
Error in SID_IA[[n]] : subscript out of bounds
In addition: Warning message:
In if (SID_IA[[n]] %in% targets) { :
the condition has length > 1 and only the first element will be used
So, it seems to be doing something for at least one iteration, but then crashes. I'm sure I'm missing something very elementary, but I just can't see it. Any ideas?
EDIT: As requested, I've removed the images and made the test Excel sheets available here and here.
OK.. I'm attempting an answer that should require minimum use of fancy tricks.
data<- xlsx::read.xlsx(file = "Data.xlsx",sheetIndex = 1)
target<- xlsx::read.xlsx(file = "Targets.xlsx",sheetIndex = 1)
head(data)
target
These values are already in data.frame format. If all you want to know is which rows appear exactly same in data and target, then it will be as simple as finding a merge
merge(target,data,all = F)
If, on the other hand , you want to keep the data table with a marking of target rows, then the easiest way will be to make an index column
data$indx<- 1:nrow(data)
data
mrg<- merge(target,data,all = F)
data$test<- rep("test", nrow(data))
data$test[mrg$indx]<- "target"
data
This is like the original image you'd posted.
BTW , if yo are on a graphical interface you can also use File dialogue to open data files.. check out file.choose()
(Posted on behalf of the OP).
Following from #R.S.'s suggestion that didn't involve vectors and loops, and after some playing around, I have figured out how to extract the target lines, and then how to remove them from the original data, outputting both results. I'm leaving it here for future reference and considering this solved.
extracted <- merge(targets,data,all = F)
write.xlsx(extracted,output_file1)
combined <-rbind(data,extracted)
minus.target <- combined[!duplicated(combined,fromLast = FALSE)&!duplicated(combined,fromLast = TRUE),]
write.xls(minus.target,output_file2)

R: subset() function altered character data into strange code

i read some data into R with the read.xlsx() in openxlsx package, and here's my code for reading the data:
data_all = read.xlsx(xlsxFile = paste0(path, EoLfileName), sheet = 1, detectDates = T, skipEmptyRows = F)
now, when i access one name cell in my data, it will print the name in characters:
> data_all[1,'name']
[1] "76-ES+ADVIP-20G"
now, lets say i want to subset out some rows based on a condition on another colum:
data_sub = subset(data_all, !is.na(data_all$amount))
however, then if i print this subset data, i'd get:
> data_sub[1,'name']
[1] "A94198.10"
i've also tried to do subsetting using the following method:
data_sub = data_all[!is.na(data_all$amount),]
but i get the same thing: the expected output of "76-ES+ADVIP-20G" would be turned into "A94198.10"
I've checked many times with mode() and str() for data_all$name and data_sub$name, both return character, so they are in correct format.
here's a link to smaple data to play with:
https://drive.google.com/file/d/0BwIbultIWxeVY1VtdDU5NFp1Tkk/view?usp=sharing
Please please help me! I am quite stuck, and i dont see other posts with similar problem.
Why is this happeneing? subsetting shouldnt change data formatting correct?
Thank you in advance for your help!
additional note (if its helpful):
so when i tried to debug, i noticed that, when i was viewing the data_all in RStudio, and if i copy and paste the name "76-ES+ADVIP-20G" into the filter bar, it actually cannot find it; i'd have to type in "76-ES" and as soon as i type in the next character which is "+", RStudio data view filter would say "no matching records found"

R - combining lines from multiple CSV into a data frame

I have a folder with hundreds of CSV files each containing data for a particular postal code.
Each CSV files contains two columns and thousands of rows. Descriptors are in Column A, values are in Column B.
I need to extract two pieces of information from each file and create a new table or dataframe using the values in [Column A, Row 2] (which is the postal code) and [Column B, Row 1585] (which is the median income).
The end result should be a table/dataframe with two columns: one for postal code, the other for median income.
Any help or advice would be appreciated.
Disclaimer: this question is pretty vague. Next time, be sure to add a reproducible example that we can run on our machines. It will help you, the people answering your questions, and future users.
You might try something like:
files = list.files("~/Directory")
my_df = data.frame(matrix(ncol = 2, nrow = length(files)
for(i in 1:length(files)){
row1 = read.csv("~/Directory/files[i]",nrows = 1)
row2 = read.csv("~/Directory/files[i]", skip = 1585, nrows = 1)
my_df = rbind(my_df, rbind(row1, row2))
}
my_df = my_df[,c("A","B")]
# Note on interpreting indexing syntax:
Read this as "my_df is now (=) my_df such that ([) the columns (,)
are only A and B (c("A", "B")) "
You can use list.files function to get directories for all your files and then use read.csv and rbind in for loop to create one data.frame.
Something like this:
direct<-list.files("directory_to_your_files")
df<-NULL
for(i in length(direct)){
df<-rbind(df,read.csv(direct[i]))
}
So here is the code which does what I want it to do. If there are more elegant solutions, please feel free to point them out.
# set the working directory to where the data files are stored
setwd("/foo")
# count the files
files = list.files("/foo")
#create an empty dataframe and name the columns
dataMatrix=data.frame(matrix(c(rep(NA,times=2*length(files))),nrow=length(files)))
colnames(dataMatrix)=c("Postal Code", "Median Income")
# create a for loop to get the information in R2/C1 and R1585/C2 of each data file
# Data is R2/C1 is a string, but is interpreted as a number unless specifically declared a string
for(i in 1:length(files)) {
getData = read.csv(files[i],header=F)
dataMatrix[i,1]=toString(getData[2,1])
dataMatrix[i,2]=(getData[1585,2])
}
Thank you to all those who helped me figure this out, especially Nancy.

Feed new ID into script repeatedly

I searched for a solution to my question for awhile but did not see one that I could get working. Basically I have the following situation:
I read a file into a data frame called df1 that has a lot of id (each id can be in the file 80-120 times), date, and numerical data.
I have a script that does a bunch of caluclations and then exports a csv file with the title as the classifcation I have created, an underscore, and the id like below. Each file only contains 1 unique id but is usually 80+ rows.
write.table(df,
file = paste(unique(df$classification), "_", unique(df$id), ".csv"),
sep = ",", row.names = FALSE)
What I am hoping to do is, after I read in the file, get a unique list (I assume this would be a list?) of the id values, and then feed this into the rest of the script one value at a time. So essentially, I would take the first unique id in df1, feed it into the subset function, do a bunch of calculations, and then export the file. Move on to the second unique id, feed it into the subset, do a bunch of calculations, export the file. Rinse and repeat. This seems trivial but I have struggled to find a solution. Any help would be greatly appreciated!
I assume I can put a loop together prior to the line below and then have it loop through the entire script replacing the xxxxxxxxx with a new id each time?
df <- subset(df1, id == xxxxxxxxxxxxxxx)
If I understand your question correctly, you should be able to loop through like this:
for(i in unique(df1$id)){
df <- df1[df1$id == i,]
...
}

Resources