We have answer of your question!

100% solved queries, no empty question

Question: Unusual structure of a txt file


0

Having such a text file (example): https://drive.google.com/open?id=0B1vq9WjkqkvzTEVEUnlXMGVFa00

Original file has 65k rows. I need to upload it to R and make processable. I used the following functions:

  1. read.table - didn't work (R never returned any result)
  2. fread from data.table package - required a lot of manual preprocessing of the file and didn't work as needed as quotes broke the line and the file wasn't in the appropriate form)
  3. scan got a vector, transformation into matrix didn't bring needed results.

Desired form of the file is a regular data frame:

mydata <- structure(list(fieldName = structure(c(3L, 3L), .Label = c("description", 
    "scraped_manufacturer", "title"), class = "factor"), foreign_id = c(13389, 
    13389), is_single_product = structure(1:2, .Label = c("FALSE", 
    "TRUE"), class = "factor"), matched_manufacturers = c("Foden /manId: 76775", 
    "Caterpillar /manId: 74, Skogsjan-Caterpillar /manId: 10329"), 
        matched_products = c("", "C12 /modelId: 32774 /manId: 74"
        ), raw_string = c("CAT FODEN C-12 ENGINE", "CATERPILLAR C-12 ENGINE"
        ), pagesource = structure(c(84L, 84L), .Label = c("", "585e362f6b010083d6962041", 
        "585f270a300000c614b819ed", "585f84be6b0100c6ee962ab1", "585f84dc66010074efac42ca", 
        "585f875a6b0100c7ee963000", "585f878c66010074efac483e", "585f87ad66010075efac4880", 
        "585f88e06b0100b6ee96331c", "585f8b4566010074efac4fcb", "agriaffaires", 
        "apex-auctions", "arlington-plastics-machinery", "auctelia", 
        "auctions-international", "autogilles", "baestlein", "baupool", 
        "bavaria-swiss-ag", "big-iron", "big-machinery", "blackforxx", 
        "blue-group", "bpi-associates", "buk-baumaschinen", "cegema", 
        "christophbusch", "cjm-asset", "classified", "cnc-auction", 
        "cottrill-and-co", "daan", "de-vries", "dechow", "dimex-import-export", 
        "e-farm", "ebay", "ebay-de", "eberle-hald-gmbh", "eggers-landmaschinen", 
        "euro-auctions", "fabricating-machinery-corp", "fastline", 
        "ferwood", "fh-machinery", "first-machinery-auctions-limited", 
        "forklift-international", "ga-tec-gabelstaplertechnik", "gambtec", 
        "geiger", "german-graphics", "goindustry-dovebid", "graf", 
        "gruma-nutzfahrzeuge-gmbh", "hanselmann", "heinrich-kuper-gmbh", 
        "hooray-machinery", "imz-maschinen", "industrial-discount", 
        "ipr-petmachinery", "ironplanet", "ironplanet-com", "karl-guenter-wirths-gmbh", 
        "karner-dechow", "kurt-steiger", "kvd-auctions", "lagermaschinen", 
        "leinweber-landtechnik", "mach4metal", "machinefinder", "machinery-park", 
        "machineryzone", "maschinenbau-rehnen-gmbh", "mideast-equipment", 
        "mmtequipment", "oskar-broziat-maschinen", "perfection-global", 
        "perlick", "perry-videx", "pfeifer-machinery", "plustech-as", 
        "polboto-agri-sp-z-oo", "pressenhaas", "rc-tuxford-exports", 
        "resale", "restlos", "richter-friedewald-gmbh", "ritchie-bros", 
        "rock-and-dirt", "rogiers", "rs-auktionen", "stig-bindner", 
        "surplex", "technikboerse", "themar-trucks", "traktorpool", 
        "unilift", "vebim", "vertimac", "zeppelin-caterpillar", "zoll-auktion", 
        "zuern-gmbh"), class = "factor")), .Names = c("fieldName", 
    "foreign_id", "is_single_product", "matched_manufacturers", "matched_products", 
    "raw_string", "pagesource"), row.names = 1:2, class = "data.frame")

Any ideas of how to make it possible to work with the file?

Question author Hamsy | Source

Answer


1


Consider opening the text file in a software that can read RTF types. On Windows machines, Microsoft Word and the built-in Wordpad can read .rtf documents. And in doing so, a valid json shows in document (without markup content).

JSON text

Fortunately, R on Windows can connect to the MS Word Object Library using the RDCOMClient library where you extract text using the Document.Content property. Once you read in the json text, use the jsonlite library to migrate content to a dataframe:

library(RDCOMClient)
library(jsonlite)

# OPEN WORD APP
wrdApp = COMCreate("Word.Application")
wrdDoc = wrdApp$Documents()$Open("C:\Path\To\Data.txt")    
wrdtext = wrdDoc[['Content']]

# EXTRACT TEXT TO R VARIABLE
doc = wrdtext$Text()

# CLOSE APP
wrdDoc$Close(FALSE)
wrdApp$Quit()

# RELEASE RESOURCES
wrdtext <- wrdDoc <- wrdApp <- NULL
rm(wrdtext, wrdDoc, wrdApp)
gc()

# RAW DF: NAME / COLUMNS / VALUES LIST TYPES
rawdf <- fromJSON(doc)[[1]][[1]][[1]]

# FINAL DF: NORMALIZING VALUES WITH COL NAMES
finaldf <- setNames(data.frame(rawdf$values, stringsAsFactors = FALSE),
                               rawdf$columns[[1]])

Output

Final Dataframe


Alternative

Should you not have MS Word installed. Launch the CMD prompt and open Wordpad (built-in Windows app) with the command line and copy all content to a .json file (or right click text file and open with Wordpad). Do the counterpart of special app and terminal call if on another OS (Linux/Mac):

write "D:\Path\To\Data.txt"

After json file is saved, then in R run:

rawdf <- do.call(rbind, lapply(paste(reaadLines("C:\Path\To\Data.json", warn=FALSE),
                                      collapse=""), 
                                jsonlite::fromJSON))[[1]][[1]][[1]]

finaldf <- setNames(data.frame(rawdf$values, stringsAsFactors = FALSE),
                    rawdf$columns[[1]])
Answer author Parfait

Tickanswer.com is providing the only single recommended solution of the question Unusual structure of a txt file under the categories i.e r , . Our team of experts filter the best solution for you.

Related Search Queries:


You may also add your answer

Thanks for contributing an answer to Tick Answer!